Before reading this blogpost you should read: Practical guide to libSVM, Wikipedia entry on SVM, Wikipedia entry on cross-validation, and PRML - Bishop's chapter on SVM. In this post, I will not explain how to use open source SVM tools (like libsvm or svmlight), but would rather assume that you have used them and want some clever script to automate them. Also note that, the below script uses default parameters. So if you get poor results, you can add an outer for loop that iterates through your choice of parameters and supplies them to the program ./svm_learn.
To perform cross-validation on libsvm:
Here is how you run this script:
myDataInSVMLightFormat.txt looks like "binaryClassifier(-1 or 1) featureNum:value ....". Example:
After running this script you will get two important files: MyResults.txt and betterMyResults.txt (which is a space separated file with format: 'validation_cycle_num [poly or rbf] parameterValue accuracy'). You can then import betterMyResults.txt in R or Matlab and get some pretty graphs :)
To perform cross-validation on libsvm:
- Download and extract it
- You need gnuplot installed on the machine you want to run libsvm. If you want to run it on the server, do 'ssh -X myuserName@myMachineName'
- make
- Go to Tools folder and run 'python easy.py my_training_file my_test_file'
- Download and extract it. There you will find two important binaries: svm_learn and svm_classify
- Unfortunately the data gathering using svmlight is not easy, so I have coded following script.
#!/bin/bash # Usage: svm_light_cross_validation full_data num_train_items num_test_items k_cycles results_file RESULT_FILE=$5 echo "Running SVM-Light via cross validation on" $1 "by using" $2 "training items and" $3 "test items (Total number of cross-validation cycles:" $4 > $RESULT_FILE MODEL_FILE="model."$RANDOM".txt" TEMP_FILE="tempFile."$RANDOM".txt" PRED_FILE="prediction."$RANDOM".txt" DATA_FILE=$1 NUM_TRAIN=$2 NUM_TEST=$3 NUM_CYCLES=$4 TEMP_DATA_FILE=$DATA_FILE"."$RANDOM".temp" TRAIN_FILE=$TEMP_DATA_FILE".train" TEST_FILE=$TEMP_DATA_FILE".test" TEMP_RESULT=$RESULT_FILE".temp" SVM_PATTERN='s/Accuracy on test set: \([0-9]*.[0-9]*\)% ([0-9]* correct, [0-9]* incorrect, [0-9]* total)\.*/' for k in `seq 1 $NUM_CYCLES` do sort -R $DATA_FILE > $TEMP_DATA_FILE head -n $NUM_TRAIN $TEMP_DATA_FILE > $TRAIN_FILE tail -n $NUM_TEST $TEMP_DATA_FILE > $TEST_FILE echo "------------------------------------------" >> $RESULT_FILE echo "Cross-validation cycle:" $k >> $RESULT_FILE # first run svm with default parameters echo "" >> $RESULT_FILE echo "Polynomial SVM with default parameters" >> $RESULT_FILE for i in 1 2 3 4 5 6 7 8 9 10 do echo "order:" $i >> $RESULT_FILE ./svm_learn -t 1 -d $i $TRAIN_FILE $MODEL_FILE > $TEMP_FILE ./svm_classify -v 1 $TEST_FILE $MODEL_FILE $PRED_FILE > $TEMP_RESULT cat $TEMP_RESULT >> $RESULT_FILE sed '/^Reading model/d' $TEMP_RESULT > $TEMP_RESULT"1" sed '/^Precision/d' $TEMP_RESULT"1" > $TEMP_RESULT sed "$SVM_PATTERN$k poly $i \1/g" $TEMP_RESULT >> "better"$RESULT_FILE done echo "" >> $RESULT_FILE echo "RBF SVM with default parameters" >> $RESULT_FILE for g in 0.00001 0.0001 0.001 0.1 1 2 3 5 10 20 50 100 200 500 1000 do echo "gamma:" $g >> $RESULT_FILE ./svm_learn -t 2 -g $g $TRAIN_FILE $MODEL_FILE > $TEMP_FILE ./svm_classify -v 1 $TEST_FILE $MODEL_FILE $PRED_FILE >> $TEMP_FILE cat $TEMP_RESULT >> $RESULT_FILE sed '/^Reading model/d' $TEMP_RESULT > $TEMP_RESULT"1" sed '/^Precision/d' $TEMP_RESULT"1" > $TEMP_RESULT sed "$SVM_PATTERN$k rbf $g \1/g" $TEMP_RESULT >> "better"$RESULT_FILE done done rm $MODEL_FILE $TEMP_FILE $PRED_FILE $TEMP_DATA_FILE $TEMP_RESULT $TEMP_RESULT"1" echo "Done." >> $RESULT_FILE
Here is how you run this script:
./svm_light_cross_validation myDataInSVMLightFormat.txt 5000 2000 10 MyResults.txt
myDataInSVMLightFormat.txt looks like "binaryClassifier(-1 or 1) featureNum:value ....". Example:
1 1:1.75814e-14 2:1.74821e-05 3:1.37931e-08 4:1.25827e-14 5:4.09717e-05 6:1.28084e-09 7:2.80137e-22 8:2.17821e-24 9:0.00600121 10:0.002669 11:1.19775e-05 12:8.24607e-15 13:8.36358e-10 14:0.0532899 15:3.25028e-06 16:0.148439 17:9.76215e-08 18:0.00148927 19:3.69801e-16 20:4.20283e-16 21:7.9941e-05 22:5.7593e-09 23:0.000251052 24:0.000184218 25:7.07359e-06
After running this script you will get two important files: MyResults.txt and betterMyResults.txt (which is a space separated file with format: 'validation_cycle_num [poly or rbf] parameterValue accuracy'). You can then import betterMyResults.txt in R or Matlab and get some pretty graphs :)
7 comments:
This is a great contribution. My honest thanks.
Thanks for your instructions. But in which folder, should I run the svm_light_cross_validation script?
i used this script for testing it on handful of files. If you have tons of data files to test, I would either change the variables starting from line 7 (eg: MODEL_FILE="./someDir/model."$RANDOM".txt" and so on) or simply write a wrapper shell script over it :)
I might not make me well understood. I have only one file to do the experiment. Also, I placed your script under the sum_light folder which is at the same level as my file, errors like "./svm_light_cross_validation: line 46: 3465 Floating point exception: 8 ./svm_learn -t 2 -g $g $TRAIN_FILE $MODEL_FILE > $TEMP_FILE" repeated. can you shed me some light to make it work?
Have you downloaded svmlight and placed svm_learn and svm_classify binaries in the same folder. Also to double check please run "./svm_learn" to see if the binaries are working on your OS.
SVM itself is good. I could use it perform default training and testing.
"./svm_light_cross_validation ../feature_set 200 36 10 MyResults.txt" is how I run it, in SVM folder.
Just to confirm, in your MyResults.txt file, entries after "Polynomial SVM with default parameters" are looking good. Also, I suspect you might not be getting entries after the line "RBF SVM with default parameters" except "gamma: 0.00001", "gamma: 0.0001" and so on. Am I correct ? If yes, your shell is probably having problem interpreting the variable "g" in line 46. Try putting a floating point operation (like addition or something) after line 48 and see if we are heading in right direction.
Post a Comment