Thursday, February 02, 2012

Cross-validation on SVM-Light

Before reading this blogpost you should read: Practical guide to libSVM, Wikipedia entry on SVM, Wikipedia entry on cross-validation, and PRML - Bishop's chapter on SVM.  In this post, I will not explain how to use open source SVM tools (like libsvm or svmlight), but would rather assume that you have used them and want some clever script to automate them. Also note that, the below script uses default parameters. So if you get poor results, you can add an outer for loop that iterates through your choice of parameters and supplies them to the program ./svm_learn.

To perform cross-validation on libsvm:
  1. Download and extract it
  2. You need gnuplot installed on the machine you want to run libsvm. If you want to run it on the server, do 'ssh -X myuserName@myMachineName'
  3. make
  4. Go to Tools folder and run 'python easy.py my_training_file my_test_file'
To perform cross-validation on svmlight:
  1. Download and extract it. There you will find two important binaries: svm_learn and svm_classify
  2. Unfortunately the data gathering using svmlight is not easy, so I have coded following script. 
#!/bin/bash
# Usage: svm_light_cross_validation full_data num_train_items num_test_items k_cycles results_file
RESULT_FILE=$5

echo "Running SVM-Light via cross validation on" $1 "by using" $2 "training items and" $3 "test items (Total number of cross-validation cycles:" $4 > $RESULT_FILE

MODEL_FILE="model."$RANDOM".txt"
TEMP_FILE="tempFile."$RANDOM".txt"
PRED_FILE="prediction."$RANDOM".txt"
DATA_FILE=$1
NUM_TRAIN=$2
NUM_TEST=$3
NUM_CYCLES=$4

TEMP_DATA_FILE=$DATA_FILE"."$RANDOM".temp"
TRAIN_FILE=$TEMP_DATA_FILE".train"
TEST_FILE=$TEMP_DATA_FILE".test"

TEMP_RESULT=$RESULT_FILE".temp"
SVM_PATTERN='s/Accuracy on test set: \([0-9]*.[0-9]*\)% ([0-9]* correct, [0-9]* incorrect, [0-9]* total)\.*/'
for k in `seq 1 $NUM_CYCLES`
do
 sort -R $DATA_FILE > $TEMP_DATA_FILE
 head -n $NUM_TRAIN $TEMP_DATA_FILE > $TRAIN_FILE
 tail -n $NUM_TEST $TEMP_DATA_FILE > $TEST_FILE
 
 echo "------------------------------------------"  >> $RESULT_FILE
 echo "Cross-validation cycle:" $k >> $RESULT_FILE
 
 # first run svm with default parameters
 echo ""  >> $RESULT_FILE
 echo "Polynomial SVM with default parameters" >> $RESULT_FILE
 for i in 1 2 3 4 5 6 7 8 9 10
 do
  echo "order:" $i >> $RESULT_FILE
  ./svm_learn -t 1 -d $i $TRAIN_FILE $MODEL_FILE > $TEMP_FILE
  ./svm_classify -v 1 $TEST_FILE $MODEL_FILE $PRED_FILE > $TEMP_RESULT
  cat $TEMP_RESULT >> $RESULT_FILE
  sed '/^Reading model/d' $TEMP_RESULT > $TEMP_RESULT"1"
  sed '/^Precision/d' $TEMP_RESULT"1" > $TEMP_RESULT
  sed "$SVM_PATTERN$k poly $i \1/g" $TEMP_RESULT >> "better"$RESULT_FILE
 done

 echo ""  >> $RESULT_FILE
 echo "RBF SVM with default parameters" >> $RESULT_FILE
 for g in 0.00001 0.0001 0.001 0.1 1 2 3 5 10 20 50 100 200 500 1000
 do
  echo "gamma:" $g >> $RESULT_FILE
  ./svm_learn -t 2 -g $g $TRAIN_FILE $MODEL_FILE > $TEMP_FILE
  ./svm_classify -v 1 $TEST_FILE $MODEL_FILE $PRED_FILE >> $TEMP_FILE
  cat $TEMP_RESULT >> $RESULT_FILE
  sed '/^Reading model/d' $TEMP_RESULT > $TEMP_RESULT"1"
  sed '/^Precision/d' $TEMP_RESULT"1" > $TEMP_RESULT
  sed "$SVM_PATTERN$k rbf $g \1/g" $TEMP_RESULT >> "better"$RESULT_FILE
 done

done

rm $MODEL_FILE $TEMP_FILE $PRED_FILE $TEMP_DATA_FILE $TEMP_RESULT $TEMP_RESULT"1"
echo "Done." >> $RESULT_FILE

Here is how you run this script:
./svm_light_cross_validation myDataInSVMLightFormat.txt 5000 2000 10 MyResults.txt

myDataInSVMLightFormat.txt looks like "binaryClassifier(-1 or 1) featureNum:value ....". Example:
1 1:1.75814e-14 2:1.74821e-05 3:1.37931e-08 4:1.25827e-14 5:4.09717e-05 6:1.28084e-09 7:2.80137e-22 8:2.17821e-24 9:0.00600121 10:0.002669 11:1.19775e-05 12:8.24607e-15 13:8.36358e-10 14:0.0532899 15:3.25028e-06 16:0.148439 17:9.76215e-08 18:0.00148927 19:3.69801e-16 20:4.20283e-16 21:7.9941e-05 22:5.7593e-09 23:0.000251052 24:0.000184218 25:7.07359e-06 

After running this script you will get two important files: MyResults.txt and betterMyResults.txt (which is a space separated file with format: 'validation_cycle_num [poly or rbf] parameterValue accuracy'). You can then import betterMyResults.txt in R or Matlab and get some pretty graphs :)

7 comments:

Freddy Lopez said...

This is a great contribution. My honest thanks.

Unknown said...

Thanks for your instructions. But in which folder, should I run the svm_light_cross_validation script?

Niketan said...

i used this script for testing it on handful of files. If you have tons of data files to test, I would either change the variables starting from line 7 (eg: MODEL_FILE="./someDir/model."$RANDOM".txt" and so on) or simply write a wrapper shell script over it :)

Unknown said...

I might not make me well understood. I have only one file to do the experiment. Also, I placed your script under the sum_light folder which is at the same level as my file, errors like "./svm_light_cross_validation: line 46: 3465 Floating point exception: 8 ./svm_learn -t 2 -g $g $TRAIN_FILE $MODEL_FILE > $TEMP_FILE" repeated. can you shed me some light to make it work?

Niketan said...

Have you downloaded svmlight and placed svm_learn and svm_classify binaries in the same folder. Also to double check please run "./svm_learn" to see if the binaries are working on your OS.

Unknown said...

SVM itself is good. I could use it perform default training and testing.
"./svm_light_cross_validation ../feature_set 200 36 10 MyResults.txt" is how I run it, in SVM folder.

Niketan said...

Just to confirm, in your MyResults.txt file, entries after "Polynomial SVM with default parameters" are looking good. Also, I suspect you might not be getting entries after the line "RBF SVM with default parameters" except "gamma: 0.00001", "gamma: 0.0001" and so on. Am I correct ? If yes, your shell is probably having problem interpreting the variable "g" in line 46. Try putting a floating point operation (like addition or something) after line 48 and see if we are heading in right direction.