Wednesday, February 01, 2012

Using VIM for "fast" data transformation

Recently, I had to give my data to a lab-mate. I had done some pre-processing on 20 Newsgroup dataset and converted it to following format:
- Each line represented a document.
- Words were replaced by an integer that denoted it's index in vocabulary.txt file
- The line was append with a binary classifier (1 or -1) based on whether the document was in "comp" newsgroup or "sci" newsgroup.
- The words were represented along with their frequency.
For example, let's assume for now the word "technology" is 1002th word in the vocabulary. Then, if the word "technology" appeared 10 times in 12th document (in the folder comp.graphics), then 12th line in my file will be:
1 1002:10 ...

Here is a sample line in the file:
1 3798:1 9450:1 12429:1 13352:1 14155:1 15858:1 22319:1 29652:1 31220:1 34466:2 35883:1 37734:1 40188:1 40683:1
1 90:1 1078:1 2101:2 3775:1 5183:2 5195:1 5747:1 7908:1 7963:1 9748:1 11294:3 14879:1 16006:2 18188:1 19742:1 20928:1 21321:1 21935:1 23613:1 25354:2 26721:1 29652:1 30407:1 34054:1 36546:2 38252:1 39376:2 40204:1
Now, I had to convert this to following format:
fileNum | wordIndex | frequency
For example:
3|3798|1
3|9450|1
3|12429|1
3|13352|1
3|14155|1
I used following VIM tricks:
1. First delete the binary classifiers (and also blank spaces at the end of the line):

:%s/^-1//g
:%s/^1//g
:%s/ $//g
2. Then, comes the most important part: Insert line number before every word.
To do this, I used following trick: - First replace spaces by some arbitrary character, in my case '='
- Then, replace that character by the line number !!!
:%s/ / =/g
:g/=/exec "s/=/ ".line(".")."|/g"
The last command needs some explanation: 
It says, execute a command globally whenever you see '=' character: :g/=/exec some-command
In our case the command is replacing that character by the line number. The line number of a given line can be found using VIM's internal function: line(".")
Hence, the input file is converted into following format:
3|3798:1  3|9450:1  3|12429:1  3|13352:1  3|14155:1  3|15858:1  3|22319:1  3|29652:1  3|31220:1  3|34466:2  3|35883:1  3|37734:1  3|40188:1  3|40683:1
  4|90:1  4|1078:1  4|2101:2  4|3775:1  4|5183:2  4|5195:1  4|5747:1  4|7908:1  4|7963:1  4|9748:1  4|11294:3  4|14879:1  4|16006:2  4|18188:1  4|19742:1  4|20928:1  4|21321:1  4|21935:1  4|23613:1  4|25354:2  4|26721:1  4|29652:1  4|30407:1  4|34054:1  4|36546:2  4|38252:1  4|39376:2  4|40204:1
3. Now, to some garbage cleaning and trivial replacement:
:%s/:/|/g
:%s/  / /g
:%s/^ //g
4. Finally, replace spaces by a newline character:
:%s/ /!!!!!!/g where !!!!!! is press Control+V and then press Enter 
Control-V is the special character escape key. Remember the character '\n' won't work here.

And voila you are done !!!

No comments: