GitHub - dat/svm-tools: This collection of utilities that complements LIBSVM provides tools to convert ARFF file to SVM file format, shuffling, remapping, and distributed grid searching across multiple computers for support vector machine training.

dat / svm-tools Public

This collection of utilities that complements LIBSVM provides tools to convert ARFF file to SVM file format, shuffling, remapping, and distributed grid searching across multiple computers for support vector machine training.

11 stars 8 forks Branches Tags Activity

Star

Notifications

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README		README
aliasing.sh		aliasing.sh
arff2svm.py		arff2svm.py
grid-search.py		grid-search.py
report.py		report.py
shuffle.sh		shuffle.sh
svm-remap.py		svm-remap.py

Repository files navigation

Support Vector Machine Training Utilities

Author: dth

Description:

    This collection of utilities provide support to convert ARFF file to SVM file
    format, shuffling, remapping, and distributed grid searching across multiple computers.

Requirements:
    - ssh client
    - bash
    - Python 3.0+
    - svm-train, svm-predict, svm-scale (provided by LIBSVM library)

Utilities:
    All utilities provide help/usage message when invoked with -h or --help flags.
    - aliasing.py : alias all the utilities to your shell 
    - arff2svm.py : convert ARFF file format to SVM file format
    - grid-search.py : distributed grid searching across (c,g)
    - report.py : generate accuracy report with confusion matrices
    - shuffle.sh : shuffle lines of a file
    - svm-remap.py : remapping SVM files to each other to matching fields

ARFF Format:
    The ARFF (Attribute-Relation File Format) format was devised by the Machine
    Learning Project at the Department of Computer Science of The University of
    Waikato for use with the Weka machine learning software. More information
    can be found here (http://www.cs.waikato.ac.nz/~ml/weka/arff.html).

    The data section of of an ARFF file obeys the following Backus-Naur Form:

    <line> ::= (<value> ",")+ <category>
    <value> ::= <float>
    <category> ::= <string>

SVM Format:
    SVM format obeys the following Backus-Naur Form, originated from this
    (http://www.cs.cornell.edu/People/tj/svm_light/):

    <line> ::= <target> " " (<feature> ":" <value>)+
    <target> ::= <positive-integer>
    <feature> ::= <positive-integer>
    <value> ::= <float>

    Features with value zero can be skipped.

Instructions:

a) Set up remoteless SSH.

b) Setting up the utilities.

    1. Edit grid-search.py to the desired number of processes to spawn, SSH servers to use,
        location of svm-train on each server, and additional arguments to svm-train.
    2. Edit report.py to the location of svm-predict on each server.

c) Example:
    Suppose you want to conduct grid searching on 5,000 random data points of 10,000-point dataset
    train.arff file using 5-fold cross validation. Then use the best parameters found to create a 
    model with the entire 10,000-point dataset and test that on 500-point test.arff file.

    1. arff2svm train.fields train.arff train.svm
    2. arff2svm test.fields test.arff tmp.svm
    3. svm-remap test.fields train.fields tmp.svm test.svm
    4. svm-scale -l 0 -u 1 -s scaling_parameters train.svm > trains.svm
    5. svm-scale -l 0 -u 1 -r scaling_parameters test.svm > tests.svm
    4. shuffle trains.svm
    3. head -n 5000 trains.svm > grid.svm
    4. Distribute grid.svm across all the computing nodes. We will assume as "~/shared/grid.svm".
    5. grid-search --log2c -5 5 1 --log2g -5 5 1 --fold 5 "~/shared/grid.svm" gridscore.file
    6. Examine gridscore.file to find the best (log2c, log2g) parameters.
        Suppose they are log2c=1, log2g=0; thus c = 2^log2c = 2, g = 2^log2g = 1.
    7. svm-train -h 0 -t 2 -c 2 -g 1 trains.svm model_t2-c2-g1
    8. report --fields train.fields model_t2-c2-g1 tests.svm

Recommendations:
    A good way to automatically distribute files across the servers is to run an NFS server
    on the head node serving the directory ~/shared. Put svm-train, svm-predict, svm-scale,
    and all training/testing data into that directory. Create "~/shared" directory on each
    computing node and mount the head node's "~/shared" folder onto that directory via NFS.
    This way any changes made to the data files and binaries will be propagated to each node.

    To ensure that your SSH session don't automatically log out when it is idle during a long
    calculation, add the following line to ~/.ssh/config or to /etc/ssh_config on the client:

        ServerAliveInterval 60

    This will send a "keep alive" signal to the server every 60 seconds.