Before reading this blogpost you should read: Practical guide to libSVM, Wikipedia entry on SVM, Wikipedia entry on cross-validation, and PRML - Bishop's chapter on SVM. In this post, I will not explain how to use open source SVM tools (like libsvm or svmlight), but would rather assume that you have used them and want some clever script to automate them. Also note that, the below script uses default parameters. So if you get poor results, you can add an outer for loop that iterates through your choice of parameters and supplies them to the program ./svm_learn.
To perform cross-validation on libsvm:
Here is how you run this script:
myDataInSVMLightFormat.txt looks like "binaryClassifier(-1 or 1) featureNum:value ....". Example:
After running this script you will get two important files: MyResults.txt and betterMyResults.txt (which is a space separated file with format: 'validation_cycle_num [poly or rbf] parameterValue accuracy'). You can then import betterMyResults.txt in R or Matlab and get some pretty graphs :)
To perform cross-validation on libsvm:
- Download and extract it
- You need gnuplot installed on the machine you want to run libsvm. If you want to run it on the server, do 'ssh -X myuserName@myMachineName'
- make
- Go to Tools folder and run 'python easy.py my_training_file my_test_file'
- Download and extract it. There you will find two important binaries: svm_learn and svm_classify
- Unfortunately the data gathering using svmlight is not easy, so I have coded following script.
#!/bin/bash # Usage: svm_light_cross_validation full_data num_train_items num_test_items k_cycles results_file RESULT_FILE=$5 echo "Running SVM-Light via cross validation on" $1 "by using" $2 "training items and" $3 "test items (Total number of cross-validation cycles:" $4 > $RESULT_FILE MODEL_FILE="model."$RANDOM".txt" TEMP_FILE="tempFile."$RANDOM".txt" PRED_FILE="prediction."$RANDOM".txt" DATA_FILE=$1 NUM_TRAIN=$2 NUM_TEST=$3 NUM_CYCLES=$4 TEMP_DATA_FILE=$DATA_FILE"."$RANDOM".temp" TRAIN_FILE=$TEMP_DATA_FILE".train" TEST_FILE=$TEMP_DATA_FILE".test" TEMP_RESULT=$RESULT_FILE".temp" SVM_PATTERN='s/Accuracy on test set: \([0-9]*.[0-9]*\)% ([0-9]* correct, [0-9]* incorrect, [0-9]* total)\.*/' for k in `seq 1 $NUM_CYCLES` do sort -R $DATA_FILE > $TEMP_DATA_FILE head -n $NUM_TRAIN $TEMP_DATA_FILE > $TRAIN_FILE tail -n $NUM_TEST $TEMP_DATA_FILE > $TEST_FILE echo "------------------------------------------" >> $RESULT_FILE echo "Cross-validation cycle:" $k >> $RESULT_FILE # first run svm with default parameters echo "" >> $RESULT_FILE echo "Polynomial SVM with default parameters" >> $RESULT_FILE for i in 1 2 3 4 5 6 7 8 9 10 do echo "order:" $i >> $RESULT_FILE ./svm_learn -t 1 -d $i $TRAIN_FILE $MODEL_FILE > $TEMP_FILE ./svm_classify -v 1 $TEST_FILE $MODEL_FILE $PRED_FILE > $TEMP_RESULT cat $TEMP_RESULT >> $RESULT_FILE sed '/^Reading model/d' $TEMP_RESULT > $TEMP_RESULT"1" sed '/^Precision/d' $TEMP_RESULT"1" > $TEMP_RESULT sed "$SVM_PATTERN$k poly $i \1/g" $TEMP_RESULT >> "better"$RESULT_FILE done echo "" >> $RESULT_FILE echo "RBF SVM with default parameters" >> $RESULT_FILE for g in 0.00001 0.0001 0.001 0.1 1 2 3 5 10 20 50 100 200 500 1000 do echo "gamma:" $g >> $RESULT_FILE ./svm_learn -t 2 -g $g $TRAIN_FILE $MODEL_FILE > $TEMP_FILE ./svm_classify -v 1 $TEST_FILE $MODEL_FILE $PRED_FILE >> $TEMP_FILE cat $TEMP_RESULT >> $RESULT_FILE sed '/^Reading model/d' $TEMP_RESULT > $TEMP_RESULT"1" sed '/^Precision/d' $TEMP_RESULT"1" > $TEMP_RESULT sed "$SVM_PATTERN$k rbf $g \1/g" $TEMP_RESULT >> "better"$RESULT_FILE done done rm $MODEL_FILE $TEMP_FILE $PRED_FILE $TEMP_DATA_FILE $TEMP_RESULT $TEMP_RESULT"1" echo "Done." >> $RESULT_FILE
Here is how you run this script:
./svm_light_cross_validation myDataInSVMLightFormat.txt 5000 2000 10 MyResults.txt
myDataInSVMLightFormat.txt looks like "binaryClassifier(-1 or 1) featureNum:value ....". Example:
1 1:1.75814e-14 2:1.74821e-05 3:1.37931e-08 4:1.25827e-14 5:4.09717e-05 6:1.28084e-09 7:2.80137e-22 8:2.17821e-24 9:0.00600121 10:0.002669 11:1.19775e-05 12:8.24607e-15 13:8.36358e-10 14:0.0532899 15:3.25028e-06 16:0.148439 17:9.76215e-08 18:0.00148927 19:3.69801e-16 20:4.20283e-16 21:7.9941e-05 22:5.7593e-09 23:0.000251052 24:0.000184218 25:7.07359e-06
After running this script you will get two important files: MyResults.txt and betterMyResults.txt (which is a space separated file with format: 'validation_cycle_num [poly or rbf] parameterValue accuracy'). You can then import betterMyResults.txt in R or Matlab and get some pretty graphs :)