## Tuesday, December 25, 2012

### My views on recent protest in New Delhi

According to 2011 census in India1, there are 586469174 women in India, i.e. 48.46% of India's population and 8.38% of total world's population (which is roughly around the total number of English and Spanish speaking people in the world).

Out of these only 65.46% are literate2 (as compared to 82.14% literacy rate in males), with the definition of literacy being very narrow.

Crimes against women3 include rape (20737 cases in the year 2007), dowry deaths (8093 cases), molestation (38734 cases), sexual harassment (10950 cases) and cruelty by husbands/relatives (75930 cases). To add to that, there is human trafficking (for sexual slavery or domestic servitude), female infanticide, discrimination, acid attacks, lack of health care and forced marriage. The number of rape cases has increased by 80% in 7 years from 2000 and is still increasing. In the year 2011, 26000 rapes were reported4 throughout India; in fact it is the fastest growing crime in India. According to Thomas Reuters Foundation5, India is the fourth most dangerous country in the world for women behind Afghanistan, Congo and Pakistan.

According to many, there are three main reasons behind this:
1. Social taboo:
We Indians (including me) are hypocrites when it comes to dealing with rape survivors6. They are outcasted to protect family's so-called honor and hence most of the rape cases aren't reported or are dropped. Also, in event of such crimes, most of us prefer not to intervene and help the victim (be it eve-teasing, dowry, rape, etc) and her family. This indifference in our attitude provides additional incentives to the rapist to commit such heinous crime.

2. Severity of punishment:
An unenforced law means nothing however stringent they are. So, I don't think severity of punishment acts as a deterrent to rapists. In fact, considering how our law enforcement agencies (both judicial and policing) are, the rapists might not be wrong to think they won't be apprehended at all. With such an alarming increase in rape cases, one might think it's reasonable to support increasing severity in special circumstances7 and especially for repeat offenders. But, the point to note is, in India, rape is punishable up to life imprisonment9 which I think is severe enough. The main problem is that it doesn't get enforced or it takes way too long.

3. Lack of speedy and fair trials with severe punishment to rape shield violators:
There already are provisions for speedy trials10 and rape shields11, but don't get executed due to lack of resources and infrastructure. First, there are 26.3 million pending cases8. Then there aren't enough women officers (I believe there should be an option for the victim to talk directly to women officers in every police chowky, for reporting rape, molestation, etc). Lastly, even though India has one of the strongest rape shield laws, Chaudhry11 says they are routinely violated or reduced to technicality in practice. It is imperative to punish any authority that violates the rape shield.

Let me reiterate my point: an unenforced law will solve nothing. I see the protestors (which btw I salute for their efforts, barring the ones that are doing it for political gains and media attention/ratings) are focusing on the wrong issue: asking for death penalty law for rape.

Why ? Researchers have shown again and again12, certainty and not severity of punishment proves to be true deterrent for the crime. In fact, increasing severity can turn to be counterproductive (i.e. may lead to more false negatives).

Sure, increasing severity might help us, as the society, deal with the shock, but it will do nothing to help improve the safety of women. The law will be passed, authorities will keep on violating rape shield, there will still be little or no resources to handle the case quickly and we shall go to living our life once again forgetting this incident ever happened and at the same time congratulating ourselves for the job well done. If you do truly care about the situation, let's forget the quick fix and do a more difficult thing: Take a step back and dispassionately think how we can solve this problem once and for all. Here is what I could think of:
- Punishment or/and suspension of authority breaking rape shield (or discouraging victims for registering complaints).
- Increase in forensic and judiciary budgets to expedite trials.
- More women cops and psychological experts to help deal with the trauma. Also, create a special department for dealing with these crimes or train existing police officers to deal with such scenarios.
- Incentive programs for improving the female literacy rate.
- Standing up for the crimes against women when we see it, while still keeping our paternal instinct in check so that every women gets to enjoy her liberty.

It took me half a day to write this blogpost and I humbly admit I am missing lot of points. But, I am sure a more experienced panel can come up with better solution if they are not sidelined by politics or their own self-interest.

References:
1. Census of India: http://www.censusindia.gov.in/
2. http://en.wikipedia.org/wiki/2011_census_of_India
3. National Crime Records Bureau: http://ncrb.nic.in/cii2007/cii-2007/1953-2007.pdf
4. http://en.wikipedia.org/wiki/Rape_in_India
5. http://www.trust.org/documents/womens-rights/resources/2011WomenPollResults.pdf
6. http://www.ndtv.com/video/player/news/determined-not-to-hide-her-identity-rape-survivor-bravely-tells-her-story/259248
7. http://en.wikipedia.org/wiki/2012_Delhi_gang_rape_case
8. http://www.hindustantimes.com/News-Feed/india/Nearly-30-million-cases-pending-in-courts/Article1-224578.aspx
9. http://indiankanoon.org/doc/1279834/
10. http://defensewiki.ibj.org/index.php/Right_to_a_Speedy_Trial#India
12. http://www.sentencingproject.org/doc/deterrence%20briefing%20.pdf

## Sunday, August 12, 2012

### Education: whether to skip school or not ?

Couple of days back, one of my friend (Deepak) posted a post, which lead to somewhat heated debate. Before, I go any further, I must admit I have not read the blog writer's (Nikhil's) book and judging from the title "One Size Does Not Fit All: A Student’s Assessment of School", he seems to be radical thinker (as suggested by Deepak). Also, the above mentioned post might have assumed that the reader has already read his book and hence the post itself could be out of context for me.

Personally, I felt the post was just a collage of motivational/self-help pieces, without a direct and measurable theme/message. People don't need cheer-leaders that say "here are pictures/quotes of Einstein, Jobs, Ford ... yaay ... now, go and become like them it"; they need a coach ... a teacher, who gives them a clearly defined way or approach to achieve it ... stating how much effort one needs to put into it and what are pros and cons of following that approach. For example, 10,000 hour rule of Gladwell's Outliers or Allen's GTD. Sure there are gray-areas, situations where people only need motivational speeches and not solutions (for example: Notes to myself) ... and may be the above post could have been intended for exactly that.

Just a side-note (and not directing directly to the author of the post), I think writers of self-help books (or any non-fiction book for that matter), should try to aspire for journalistic integrity ... which means checking facts before you make a statement, validating your propositions and yes, being direct and clear. If you don't have time to research something (for example: if you are writing a time sensitive news article), or are making some assumptions (say, about the audience or the context of the article), add a disclaimer and make a reasonable statement based on Occam's razor. And most importantly, be very critical of the implications of what you write. Sure, you are allowed to be wrong (in few if not all scenarios, for example: if you track scientific research, every once in a while an idea proposed few years back is discarded). In fact there is a mathematical argument, that you cannot prove anything to absolute certainty. But it should at least pass a sniff test of critical thinking: If you make your case to a rational person without any prejudice on that particular subject, is he/she going to dismiss it outright ? An example of totally ridiculous argument that does not pass this sniff test is Zakir Naik on NDTV about Osama bin Laden, which might not be violation of first amendment (i.e. freedom of speech); but clearly can have damaging consequences, especially if audience is ill-informed and trusts the speaker.

Now coming back to the debate (given in the below image ... the first line that is cut off is "Sorry Deepak, I don't see any tangible idea in his blog ... may be i am missing something ... sure, ") with my friend, the sentence in the blog that ticked me off was "Drop the textbooks. Get out and do it":
After this debate/conversation, I felt like I was just a critic, bashing the post and nitpicking on a single possibly unintended sentence, rather than contributing to the subject. Hence, I decided to write this post to give my views on education.

I am doing PhD, so there is going to be bias in support of education. So I ask you to consider the points discussed below with respect to your situation, taking into account my personal bias.

I honestly believe formal education is easiest and fastest way to success (though many students don't feel like that's true due to amount of competition they see around). You can either work really hard during four years of graduation or struggle rest of forty years of your career to compensate for those four years. There are exceptions of course, for example, athletes who prefer/need to horn their skills than study (like Sachin Tendulkar, Usain Bolt), entrepreneurs for whom opportunity cost is higher than education (like Bill Gates, Steve Jobs, Richard Branson) and people who are forced to discontinue their formal education due to certain circumstances (Nobel laureate Doris Lessing). Lot of books and post by these entrepreneurs advocate to try to be these exceptions, but I suggest caution. They are called "exceptions" for a reason; count number of people (either in your locality or in the world) who left their schools versus number of people who attended the schools, and then use your measure of success to classify them as successul or unsuccessful ... you will find your answer :)

So, why do some smart people (like authors of books mentioned in above paragraph) suggest on leaving the schools ? Well, you need to read between the lines. If you have fairly concrete idea of your product/service (along with somewhat reasonable plan) and believe waiting to finish your education will cost you the competitive edge, then it might be a good idea to take a sabbatical and be an entrepreneur. Also, read advices and books on startup, understand the novelty, amount of hard work and dedication required in building a successful startup before you make your decision of leaving the school.

Here is a high level intuition of why formal education seems tedious and boring: Society wants to filter out people who cannot survive the dip. (The dip is a period of long struggle that is necessary to achieve the goal and be differentiated from general public). For example, if PhDs or medicine were not as difficult as they are, there would be abundance of doctors and PhDs, and the elite status they enjoy now would be gone; so the society wants to ensure that these so-called elites really have to struggle before they can be labelled into this category. Sure, there is other reasons too: Some degrees (PhDs, MDs) requires knowing lot of stuff, and hence the long graduation period. You can't become a doctor by just knowing how to apply a bandage or by memorizing prescriptions for common diseases. For sake of this argument, say if a person can, would you want your family member treated by such person ?

That leads to the question of what are rational reasons to quit your studies (other than the case discussed above): If you think (or can forecast that) after surviving the dip, you don't like what's end of it, then it might be a good idea chose a different path. For example: during first year of medical school, if you don't like the life of doctor, it might be a good idea to drop out of medical school and do something that you are really passionate about. But, it is clearly a bad idea to drop out just because you feel medical school is hard or there is too much competition. Similarly, there are also rational reasons to quit PhD, but all of them should be accompanied by careful thinking and introspection. The book also covers other scenarios such as dead-end, where you have no hope of moving ahead and are stuck in routine ... but I don't think education fits into that scenario.

Note, if you believe you have dyslexia, bipolar or any such health issues (either physical or mental), then you need to completely ignore my advices and consult your family and/or therapist, as they will give you better suggestions than this blogpost.

## Monday, July 16, 2012

### Improving speech recognition using clustering-based adaptive model assignment

I came up with this idea while working on Voca. Since I don't intend to include this as part of my PhD thesis, write a research paper (since I don't have time to perform experimental evaluation) or make money out of it by patenting it (thanks to extremely slow Rice OTT process), I thought it might at least make a good blog entry.

In this blogpost, I will discuss a novel approach to improve speech recognition that in-turn will help improve the accuracy of Voca desktop app (if and when I decide to work on it). This approach can be applied to any speech-to-text software that is to be distributed to masses (eg: Siri).

## 1. Background

1.1 Introduction
Automatic speech recognition (ASR) involves predicting the most likely sequence of words W for an input audio data A. To do so, the speech recognizers use the following optimization function[2]:
\begin{align*} \hat{W}&= {arg\,max}_{W} P(W | A) & \\ &= {arg\,max}_{W} \dfrac{P(W) P(A | W)}{P(A)} & \dots \text{ Bayes theorem} \\ &\propto {arg\,max}_{W} P(W) P(A | W) & \dots \text{ The denominator normalizes to a constant} \end{align*}

The term P (W ) means the probability of occurrence of given sequence of words and is defined by the language model. At a high level, the language model captures the inherent restrictions imposed by the english language. For example: a language model might imply that the sentence “it is raining heavily in houston” is more likely sentence that “raining it is in heavily houston”.

The term P(A|W) refers to the probability of an audio data given the sequence of word and is captured by the acoustic model, which at a high level is probabilistic mapping of the acoustic sound to the words. The acoustic model is usually found by training hidden markov model on a small subset of users. The users that train the acoustic model are referred as trainers in subsequent sections.

To summarize, the speech recognizer searches for the sequence of words, that maximizes the above function using the acoustic and the language model. The output of this search procedure is a tuple < words W, confidence >, where confidence refers to the value of the above function for W. For example: <“it is raining heavily in houston”, 0.9 >
Thus, accuracy of speech recognizer thus highly depends on these two models. Since only acoustic model varies with respect to the user, in our further discussion we will only refer to acoustic model and assume the use of a standardized language model (eg: HUB4).
1.2 Roles
There are three types of people involved in speech recognition process:
• Developers write the code for the speech recognition software.
• Trainers record their voice and give the audio files to the developers. Developers then use these audio files to train the acoustic model.
• Users are the general audience who uses the speech recognition software. The number of trainers is usually much less than number of users.
1.3 Steps in speech recognition process
The steps in typical speech recognition API/software is as follows:
1. Before releasing the API/software:
1. (a)  Developers encourages trainers to record their voice.
2. (b)  Developers use the audio files to train a universal model for all the users. The developers may also use information provided by experts to improve language model.
3. (c)  Developers bundle together the model and the search algorithm and release the API/software.
2. After releasing the API/software:
(a) The search algorithm in the API/software tries to find what the user is saying by comparing the user’s acoustic signal with its universal model using various statistical methods.

1.4 Challenges
However, speech recognition is inherently a hard problem due to issues like:
1. Differences in speaking style due to accent, regional dialect, sex, age, etc (Eg: the acoustic signature of a young chinese girl saying some words will probably differ from an older german person speaking those exact same words).
2. Lack of natural pauses to detect word boundaries.
3. Computational issues like search space, homophones (words that sounds same), echo effect due to surroundings, background noise.
Lot of research effort has been put to tackle issues due to point 2 and 3. To tackle issue 1, the most common solution is to train on as many people as possible to create an somewhat universal acoustic model that captures these differences (as mentioned in step 1b of section 1.3). This is because most people do not want to spend hours training the acoustic model and would prefer an out-of-the-box solution. The problem is this approach is that it always creates a biased acoustic model that works well for some people, but is highly inaccurate with many.

## 2. Summary

Instead of using one universal model that works for everyone, we propose the following approach:
Train k different models in step 1b of section 1.3 (where is k is reasonably small number, for example: 100) such that each of these models try to capture the speaking style of a particular group of people. Then, recommend the model that best suits the user depending on his/her acoustic similarity with each of the above groups.

2.1 Why this idea will work in practice ?
1. It is a well known fact that the accuracy of the speech recognizer is higher if the trainer and the user are the same person.
2. As pointed out earlier, most users do not want to spend hours training the model.
3. Some people are more acoustically similar than others. For eg: model trained on young native american student will suit well for a similar student, but might not suit well for an old Indian lady. Hence, if the trainer and the user speak very similarly, the model will have higher accuracy than an universal model.
4. We like to point out that users can be clustered into groups by the way they speak and the number of these groups are finite. This is basically a soft assignment, for eg: I would fall in the category American accent with probability 0.2, but in category Indian accent with probability 0.7. Thus this knowledge can be incorporated in the recognition process.
2.2 Related Works
An area that is close to our invention is speaker-adaptive modeling[1], where the universal model is adapted to the user based on training on speaker- dependent features. This means, the parameters of acoustic models are split into two parts:
• Speaker independent parameters
• Speaker dependent parameters
This division requires user to supply relatively smaller training set (which need not be done during initial setup). However, it still has following difficulties
• Training is significantly compute-intensive task and might not be a viable option for smart phones, netbooks or even laptops. Offloading the training on to server will involve more network usage (which in turn will increase the cost of speech recognition for user or/and developer).
• Training might require expert-intervention.
• It still is not an out-of-the-box solution as an off-line solution will require some training to make it speaker-adaptive.
For n users (where n k), Speaker-adaptive systems in effect will have n models and our clustering-based model assignment approach will have k models. Since k models are trained on the server, model training will also benefit from expert-intervention without any additional cost to the user. It is important to note that model assignment does not require any expert- intervention and is not same as speaker-adaptive training.

## 3. Detail discussion

3.1 Key conditions
1. Every user can be represented as a mixture model over the acoustic model. Example: Let’s say that user U1 is sampled with mixture proportion {0.9, 0.1} over model m1 and m2 respectively. This means that if we run recognition on the audio file generated by the user U1 on both the models, it is likely that model m1 will produce output that is going to be 9 times more accurate that model m2.
2. When we create a model, the job of the expert is to make sure that:
1. (a)  Each model is representative of a distinct accent.
(b)  Every user (on average) is expected to have high affinity to only 1 model.
This means if we have a two sets mixture models {0.002, 0.99, 0.003} and {0.33, 0.33, 0.33} for a random user, the former configuration is preferable.
3. Corollaries based on above points:
(a) A model is simply a cluster of users.
(b) A user can belong to multiple cluster.

3.2 Modified steps for speech recognition
We propose a modified methodology that will improve the accuracy of the speech recognition:
1. Before releasing the API/software:
1. (a)  Developers encourages trainers to record their voice.
2. (b)  Using the model described in section 3.3, we get k models (For example: k = 100).
3. (c)  Chose one of these k model as a default model. Also, set a value for the parameter n (For example: n = 3).
4. (d)  for i = 1 to n do:
- Select a new phrase pi such that all k models have at least one audio file that have phrase pi or one very similar to the phrase pi. Hint: During training process, ask every trainer to speak the phrase pi. Use edit-distance and some tolerance delta (= 5) to find similar phrases.

5. (e) Developers bundle together the default model and the search algorithm and release the API/software.
2. After releasing the API/software:
1. (a) During installation (or during periodic re-tuning), for i = 1 to n do:
1. Ask user to speak the phrase pi and send the audio file to the server.
2. For each model j, do:
1. Do speech-to-text transcription for the given audio file.
2. Let p′i,j be the output phrase of the transcription.
Let εi,j be a measure that calculates the error in transcription of phrase pj on model j.For example: εi,j = Edit-Distance(p′i,j,pi,j). Note that, depending on the training data, an expert might chose to use a different metric (before releasing the API) such as acoustic similarity between a random audio file in the model and the audio file outputted by the use.
2. (b)  Return the model to the user that has minimum error for all the phrases
3. (c)  The user uses the model that is send by the server rather than a universal model.
3.3. Model for clustering the trainers
Step 1: Create two sets:
• Set 1 (Set of models): Models that have been found to contain similar users. Initial value = {}.
• Set 2 (Set of trainers not yet clustered): Put all the trainers in Set 2
Step 2: Train an acoustic model M using audio files of trainers in set 2.
Step 3: Run recognition for the audio files of the trainers in set 2 You get εi (i.e. Edit-distance between output phrase of recognizer and the true phrase). This means some of the audio files are used as held-out. It might also be interesting to have no held-out and see the error when the training and test data are the same.
Step 4: Find maximal subset of the trainers whose Zi matches the trainers for all the phrases Train the model using these trainers and add it to Set 2 (after a sanity test i.e. this model should be able to predict a random set of audio files that it was trained on)
Step 5: Put the remaining in Set 2 and goto Step 2.

## 4. References

1. [1]  X. Huang and K.F. Lee. On speaker-independent, speaker-dependent, and speaker-adaptive speech recognition. Speech and Audio Processing, IEEE Transactions on, 1(2):150 –157, apr 1993.
2. [2]  Geoffrey Zweig and Michael Picheny. Advances in large vocabulary continuous speech recognition. volume 60 of Advances in Computers, pages 249 – 291. Elsevier, 2004.

## Thursday, February 02, 2012

### Cross-validation on SVM-Light

Before reading this blogpost you should read: Practical guide to libSVM, Wikipedia entry on SVM, Wikipedia entry on cross-validation, and PRML - Bishop's chapter on SVM.  In this post, I will not explain how to use open source SVM tools (like libsvm or svmlight), but would rather assume that you have used them and want some clever script to automate them. Also note that, the below script uses default parameters. So if you get poor results, you can add an outer for loop that iterates through your choice of parameters and supplies them to the program ./svm_learn.

To perform cross-validation on libsvm:
2. You need gnuplot installed on the machine you want to run libsvm. If you want to run it on the server, do 'ssh -X myuserName@myMachineName'
3. make
4. Go to Tools folder and run 'python easy.py my_training_file my_test_file'
To perform cross-validation on svmlight:
1. Download and extract it. There you will find two important binaries: svm_learn and svm_classify
2. Unfortunately the data gathering using svmlight is not easy, so I have coded following script.
#!/bin/bash
# Usage: svm_light_cross_validation full_data num_train_items num_test_items k_cycles results_file
RESULT_FILE=$5 echo "Running SVM-Light via cross validation on"$1 "by using" $2 "training items and"$3 "test items (Total number of cross-validation cycles:" $4 >$RESULT_FILE

MODEL_FILE="model."$RANDOM".txt" TEMP_FILE="tempFile."$RANDOM".txt"
PRED_FILE="prediction."$RANDOM".txt" DATA_FILE=$1
NUM_TRAIN=$2 NUM_TEST=$3
NUM_CYCLES=$4 TEMP_DATA_FILE=$DATA_FILE"."$RANDOM".temp" TRAIN_FILE=$TEMP_DATA_FILE".train"
TEST_FILE=$TEMP_DATA_FILE".test" TEMP_RESULT=$RESULT_FILE".temp"
SVM_PATTERN='s/Accuracy on test set: $$[0-9]*.[0-9]*$$% ([0-9]* correct, [0-9]* incorrect, [0-9]* total)\.*/'
for k in seq 1 $NUM_CYCLES do sort -R$DATA_FILE > $TEMP_DATA_FILE head -n$NUM_TRAIN $TEMP_DATA_FILE >$TRAIN_FILE
tail -n $NUM_TEST$TEMP_DATA_FILE > $TEST_FILE echo "------------------------------------------" >>$RESULT_FILE
echo "Cross-validation cycle:" $k >>$RESULT_FILE

# first run svm with default parameters
echo ""  >> $RESULT_FILE echo "Polynomial SVM with default parameters" >>$RESULT_FILE
for i in 1 2 3 4 5 6 7 8 9 10
do
echo "order:" $i >>$RESULT_FILE
./svm_learn -t 1 -d $i$TRAIN_FILE $MODEL_FILE >$TEMP_FILE
./svm_classify -v 1 $TEST_FILE$MODEL_FILE $PRED_FILE >$TEMP_RESULT
cat $TEMP_RESULT >>$RESULT_FILE
sed '/^Reading model/d' $TEMP_RESULT >$TEMP_RESULT"1"
sed '/^Precision/d' $TEMP_RESULT"1" >$TEMP_RESULT
sed "$SVM_PATTERN$k poly $i \1/g"$TEMP_RESULT >> "better"$RESULT_FILE done echo "" >>$RESULT_FILE
echo "RBF SVM with default parameters" >> $RESULT_FILE for g in 0.00001 0.0001 0.001 0.1 1 2 3 5 10 20 50 100 200 500 1000 do echo "gamma:"$g >> $RESULT_FILE ./svm_learn -t 2 -g$g $TRAIN_FILE$MODEL_FILE > $TEMP_FILE ./svm_classify -v 1$TEST_FILE $MODEL_FILE$PRED_FILE >> $TEMP_FILE cat$TEMP_RESULT >> $RESULT_FILE sed '/^Reading model/d'$TEMP_RESULT > $TEMP_RESULT"1" sed '/^Precision/d'$TEMP_RESULT"1" > $TEMP_RESULT sed "$SVM_PATTERN$k rbf$g \1/g" $TEMP_RESULT >> "better"$RESULT_FILE
done

done

rm $MODEL_FILE$TEMP_FILE $PRED_FILE$TEMP_DATA_FILE $TEMP_RESULT$TEMP_RESULT"1"
echo "Done." >> $RESULT_FILE  Here is how you run this script: ./svm_light_cross_validation myDataInSVMLightFormat.txt 5000 2000 10 MyResults.txt  myDataInSVMLightFormat.txt looks like "binaryClassifier(-1 or 1) featureNum:value ....". Example: 1 1:1.75814e-14 2:1.74821e-05 3:1.37931e-08 4:1.25827e-14 5:4.09717e-05 6:1.28084e-09 7:2.80137e-22 8:2.17821e-24 9:0.00600121 10:0.002669 11:1.19775e-05 12:8.24607e-15 13:8.36358e-10 14:0.0532899 15:3.25028e-06 16:0.148439 17:9.76215e-08 18:0.00148927 19:3.69801e-16 20:4.20283e-16 21:7.9941e-05 22:5.7593e-09 23:0.000251052 24:0.000184218 25:7.07359e-06  After running this script you will get two important files: MyResults.txt and betterMyResults.txt (which is a space separated file with format: 'validation_cycle_num [poly or rbf] parameterValue accuracy'). You can then import betterMyResults.txt in R or Matlab and get some pretty graphs :) ## Wednesday, February 01, 2012 ### Using VIM for "fast" data transformation Recently, I had to give my data to a lab-mate. I had done some pre-processing on 20 Newsgroup dataset and converted it to following format: - Each line represented a document. - Words were replaced by an integer that denoted it's index in vocabulary.txt file - The line was append with a binary classifier (1 or -1) based on whether the document was in "comp" newsgroup or "sci" newsgroup. - The words were represented along with their frequency. For example, let's assume for now the word "technology" is 1002th word in the vocabulary. Then, if the word "technology" appeared 10 times in 12th document (in the folder comp.graphics), then 12th line in my file will be: 1 1002:10 ... Here is a sample line in the file: 1 3798:1 9450:1 12429:1 13352:1 14155:1 15858:1 22319:1 29652:1 31220:1 34466:2 35883:1 37734:1 40188:1 40683:1 1 90:1 1078:1 2101:2 3775:1 5183:2 5195:1 5747:1 7908:1 7963:1 9748:1 11294:3 14879:1 16006:2 18188:1 19742:1 20928:1 21321:1 21935:1 23613:1 25354:2 26721:1 29652:1 30407:1 34054:1 36546:2 38252:1 39376:2 40204:1  Now, I had to convert this to following format: fileNum | wordIndex | frequency For example: 3|3798|1 3|9450|1 3|12429|1 3|13352|1 3|14155|1  I used following VIM tricks: 1. First delete the binary classifiers (and also blank spaces at the end of the line): :%s/^-1//g :%s/^1//g :%s/$//g

2. Then, comes the most important part: Insert line number before every word.
To do this, I used following trick: - First replace spaces by some arbitrary character, in my case '='
- Then, replace that character by the line number !!!
:%s/ / =/g
:g/=/exec "s/=/ ".line(".")."|/g"

The last command needs some explanation:
It says, execute a command globally whenever you see '=' character: :g/=/exec some-command
In our case the command is replacing that character by the line number. The line number of a given line can be found using VIM's internal function: line(".")
Hence, the input file is converted into following format:
3|3798:1  3|9450:1  3|12429:1  3|13352:1  3|14155:1  3|15858:1  3|22319:1  3|29652:1  3|31220:1  3|34466:2  3|35883:1  3|37734:1  3|40188:1  3|40683:1
4|90:1  4|1078:1  4|2101:2  4|3775:1  4|5183:2  4|5195:1  4|5747:1  4|7908:1  4|7963:1  4|9748:1  4|11294:3  4|14879:1  4|16006:2  4|18188:1  4|19742:1  4|20928:1  4|21321:1  4|21935:1  4|23613:1  4|25354:2  4|26721:1  4|29652:1  4|30407:1  4|34054:1  4|36546:2  4|38252:1  4|39376:2  4|40204:1

3. Now, to some garbage cleaning and trivial replacement:
:%s/:/|/g
:%s/  / /g
:%s/^ //g

4. Finally, replace spaces by a newline character:
:%s/ /!!!!!!/g where !!!!!! is press Control+V and then press Enter

Control-V is the special character escape key. Remember the character '\n' won't work here.

And voila you are done !!!

## Tuesday, January 17, 2012

### Iris - siri like software

Recently I have started a pet project Iris1 while doing background study for my recent paper.

Voice commands/response:
Like Siri, it takes user commands and can respond using either Voice or Text.

Portable:
Since it is written in Java, it can run on most OS, including Mac, Windows and Linux. I have only tested it on Mac Lion, so I welcome any beta testers using other OS. For now, some of the features like play itunes, open email, etc are only available on Mac. I plan to add these features on other OS as well.

Configurable:
Most of the features such as use of VOICE or TEXT as feedback mechanism, as well as feedback message are configurable through the configuration file ''user_scripts/iris.config". You can also modify any of the predefined apple scripts (for eg: playSong.scpt) in that folder which will be called by Iris on the voice cammand. Also, for advance users, there are ten function commands that he/she can use to perform user-specific tasks.

Easy to install and use:
All you need to do is download and extract the binary file and run following command:

java -Xmx200m -jar bin/Iris.jar

The above command asks Java to run the program Iris.jar (in bin folder) and also specifies that it should not use more than 200 MB of memory.

Lightweight:
The whole setup takes about 20MB.

Client-side:
The speech-to-text is done on your laptop/desktop and not on some remote server. This means you can use it even if have no internet connection. This also means that there is no possibility of some big brother monitoring your voice commands.

Open source:
You are free to download and modify the source. As I am working towards a paper deadline, I am looking for collaborators that would help me with following features:
1. User-friendly: Improving voice commands
2. More accuracy: Expanding dictionary to include more words and also to add different 'user accents' for a given word.
3. Expanding grammar to include arbitrary spoken words.
4. Apple scripts to accessing contacts and recognizing contact names to assist users with various functionality.
5. Adding GUI interface rather than command line. Since all the responses go through OutputUtility.java, this is going to be easy.
6. New features: Support for RSS, Top News, Weather, Popular city names, Dictionary/Theasurus, Web/Wikipedia search, Dictation (to text/messages/events/todo item/alarm/reminders), Jokes, Motivational quotes, More Personalization.
If you have suggestions for improvement or want to collaborate with me, email me at 'NiketanPansare AT GMail'.

Iris uses CMU's Sphinx4 for recognizing speech and FreeTTS for text-to-speech conversion.

References:
1 Iris is name of my friend's dog (well, no pun intended when I use the term 'pet project'). It is also the anagram of the word 'Siri' (a popular Apple's software for IPhone 4s).

## Monday, January 02, 2012

### C++ workflow

Recently I had a memory bug that took me 2 days to figure out. Valgrind gave no error message (well, to be honest, it did give out a warning that got lost in my log messages). gdb pointed to completely random place that led me to believe STL's stringstream was the culprit. Since there was no way I could have definitively say it's not the cause, I couldn't rule out that possibility.

My impatience got the best of me and I foolishly rewrote that part of code using a less elegant solution using arrays. Now gdb started pointing to STL's vector.clear() method and believe it or not I was using std::string to store in it, not some user defined class …. I had it … there is no way in the world I am going to believe that gdb was right (thanks to my confidence in Effective C++ and std::vector < str::string >). But, not knowing the bug in a 'single threaded' 'command-line based' C++ code with access to gdb and pretty printing was driving me crazy. My previous blogpost didn't help either :(

So, on the same evening, armed with my music player and hot cup of tea, I decided to ignore the bug and beautify my code with better comments … geek alert !!! That's when it hit me, due to my recent modification, I was writing past the array bounds in a for loop. The funny thing is it is one of the few memory bugs that valgrind does not report in its final assessment and my best guess is that I was writing over vector's header location (and vector only complains about that during its next operation (in my case was clear()) ... which for all intensive purposes is random).

Embarrassed by this stupid mistake that wasted my two days, I felt humbled to revisit my C++ work-flow.

So, here is a sample workflow:
- svn update
- Code using your favorite editor (Make sure you configure it to conform to your style rules).
- Set DEBUGFLAGS in your makefile that does static analysis (make my_proj).
- Test it on local machines using 'make my_test'
- Run valgrind if necessary and check against below given valgrind checklist
- make style
- make doc
- svn update
- svn commit -m "some meaningful message"

The key principle is:

## Automate whenever possible (with emphasis on reusability)

1. Never ever start coding without an SVN (or cvs or git, etc.). It will save you lot of headaches if your machine crashes or if you want to revert back to a previous version.
If the project has multiple coders/testers/collaborators, it helps if you have some kind of project-management tool like trac or jira. Also, if you are going to release your code to public, you might want to consider project hosting sites like Google Code or source forge.
Though this might seem obvious, don't host the svn server on personal machine or at least make sure to take regular backups if you do.
2. Use a Makefile that takes care of all necessary tasks (compilation, cleaning, data generation, document generation, code beautification, etc) and a user-friendly README file. (Advance coders can consider automate). Here is a sample Makefile of my current project:

CC = g++
DEBUGFLAGS = -g -Wall -Wextra
MINUITFLAGS = root-config --cflags root-config --libs -lMinuit2 -fopenmp
LIBS = -lgsl -lgslcblas -lm -lstdc++

HEADER = Helper.h MathHelper.h LDA.h Configuration.h
SRC = main.cpp MathHelper.cpp Helper.cpp LDA.cpp
OUTPUT = spoken_web

spoken_web:     $(SRC)$(HEADER)
$(CC)$(SRC) -o $(OUTPUT)$(DEBUGFLAGS) $(LIBS) doc:$(HEADER) spokenweb_doxygen_configuration
doxygen spokenweb_doxygen_configuration

r_data: GenerateData.R
R --slave < GenerateData.R

style: $(SRC)$(HEADER)
astyle --options=spokenweb_astyle_rules $(SRC)$(HEADER)

clean:
rm -f *.o \$(OUTPUT) *~ *.data parameters.txt metaFile.txt *.data

3. Use a document generator like doxygen. It forces you to write much cleaner documentation and more importantly cleaner interface. Trust me it will help if you had to reuse your code in future.
make doc

4. Use code beautifier like astyle that not only shields you against inelegant colleague/coder, but also personalizes the code if you like to do 'Code Reading' (Don't you clean gym equipment after workout, well, why not do the same with your code). Here is sample style rules that I love:

--style=java
--indent=spaces=2
--indent-classes
--indent-switches
--indent-preprocessor
--min-conditional-indent=0
--max-instatement-indent=80
--break-closing-brackets
--keep-one-line-blocks
--keep-one-line-statements
--convert-tabs
--align-pointer=type
--align-reference=type
--lineend=linux
--suffix=none

If your company follows other style guidelines or if you don't want to change existing formatting, remove '--suffix=none' from the style rules and once you are done, you can restore the original unformatted files (*.orig).
make style

5. Familiarize yourself with program analyzers and debugging tools, and have scripts that automate them in your coding process:
• g++'s flags (See DEBUGFLAGS in my makefile)
• Static analysis tools like cppcheck (For more cases, I am satisfied with the above g++'s flags)
• Runtime memory analyzer tool like valgrind. (Please see below given Valgrind checklist, and also my previous blogpost).
• gdb: For stepping through the code for more comprehensive debugging
There are several paid tools that do much better job of analyzing your code that those mentioned above. There are
6. I like to follow standard directory structure:
./           Makefile and configure scripts
./src       cc , h files (Use subdirectories and namespaces if necessary)
./bin      build directory
./data    data required for testing/running the code
./test      test drivers
./doc     documentation generated by doxygen
7. Maintain a list of code snippets (stored in Evernote or a personal notebook) or helper classes that you might have to use again. Note that the helper classes should have good interface and must be well documented (and robust). For example, I use GSL for writing sampling functions for bayesian programs, which has very ugly interface and requires some amount of book-keeping. So, I have coded a wrapper class that gives an R like interface which makes my code extremely readable, less error prone and I don't need to worry about any book-keeping or other numerical issues while calling them.

/// \defgroup StatDistribution R-like API for family of probability distributions
/// Naming convention:
/// I have followed the naming conventions of R programming language.
/// Please see its documentation for further details
///
/// Here is a brief intro:
/// Every distribution that R handles has four functions. There is a root name,
/// for example, the root name for the normal distribution is norm.
/// This root is prefixed by one of the letters
///   - d for "density", the density function
///   - r for "random", a random variable having the specified distribution
/// Example: For the normal distribution, these functions are dnorm, and rnorm.
///
/// Important note: Since most bayesian calculation is done in log space, (almost)
/// every density function is capable of outputing 'numerically robust' logarithmic value.
/// Hence, use dnorm(x, mean, sd, true) in your code rather than log(dnorm(x, mean, sd))
///
/// Suported continuous distribution:
/// - Uniform (unif - r,d)
/// - Univariate Normal (norm - r, d); Truncated normal (tnorm - r,d);
/// - Multivariate Normal (mnorm - r, d, conditional_r, conditional_rt) (--> Only m = 2, 3 supported)
/// - Beta (beta - r,d)
/// - Inverse Gamma (invgamma - r,d)
///
/// Suported discrete distribution:
/// - Discrete Uniform (dunif - r,d)
/// - Dirichlet (dirichlet - r,d)
/// - Multinomial (multinom - r,d)
/// - Poisson (pois - r,d)
/// @{

/// @name Continuous Uniform Distribution
///@{
/// Returns uniform variable [min, max)
double runif(double min, double max);
double dunif(double x, double min, double max, bool returnLog = false);
///@}

/// @name Discrete Uniform Distribution
///@{
/// Returns uniform variable [min, max) --> i.e. [min, max-1]
int rdunif(int min, int max);
double ddunif(int x, int min, int max, bool returnLog = false);
///@}

/// @name Univariate Normal Distribution
///@{
double rnorm(double mean, double sd);
double dnorm(double x, double mean, double sd, bool returnLog = false);
///@}

/// @name Multinomial distribution
/// Both prob and retVal should be of length sizeOfProb
/// size, say N, specifying the total number of objects that are put into
/// K (or sizeOfProb) boxes in the typical multinomial experiment.
/// If the array prob is not normalized then its entries will be treated as
/// weights and normalized appropriately. Furthermore, this function also allows
/// you to provide log probabilities (in which case, you need to set isProbInLogSpace to true)
/// Eg: double prob[3] = {0.1, 0.2, 0.7}; double retVal[3]; rmultinom(1, 3, prob, retVal, false);
///@{
void rmultinom(int size, int sizeOfProb, double* prob, unsigned int* retVal, bool isProbInLogSpace = false);
double dmultinom(int sizeOfProb, double* prob, unsigned int* x, bool isProbInLogSpace = false, bool returnLog = false);
///@}

So to sample N(10,2), instead of having code like:

gsl_rng* rng_obj = gsl_rng_alloc(gsl_rng_mt19937);
double sampledVal = gsl_ran_gaussian(rng_obj, 2) + 10;
gsl_rng_free(rng_obj);

I have to use following code (which if you have coded in R is more readable):

StatisticsHelper myStatHelper;
double sampledVal = myStatHelper.rnorm(10, 2);

## Valgrind checklist:

First run valgrind's memcheck tool (valgrind --tool=memcheck program_name) and check:
1. The heap summary (only returns memory leaks, not memory corruption errors) Eg:

==19691== HEAP SUMMARY:
==19691==     in use at exit: 0 bytes in 0 blocks
==19691==   total heap usage: 12,126 allocs, 12,126 frees, 938,522 bytes allocated
==19691==
==19691== All heap blocks were freed -- no leaks are possible

If number of allocs and the number of frees will differ in o/p, the try
valgrind --tool=memcheck --leak-check=yes --show-reachable=yes program_name

Steps to avoid memory leaks:
- If you are assigning pointers to other pointers, be very explicit as to who is responsible for the deletion of the memory.
- Delete every object created by new (especially in case of exceptions).
2. Intermediate warning/error messages irrespective of the heap summary. Eg:

==15633== Invalid write of size 4
==15633==    at 0x4163F3: LDAGibbsSampler::FillInSpeechToTextOutputs() (LDA.cpp:206)
==15633==    by 0x41565A: LDAGibbsSampler::LDAGibbsSampler(int, int, double, double, std::string, double, double, int) (LDA.cpp:28)
==15633==    by 0x402E26: main (main.cpp:34)

Note, sometime one error can propagate and throw error messages at different places, so don't panic.

The intermediate error messages can detect:
• Illegal read / Illegal write errors (like the message 'Invalid read of size 4' above): It can mean any of the memory corruptions given below. If the error message is not informative, run valgrind again with '--read-var-info=yes' flag.
Common memory corruption bugs like these are:
- Writing/Accessing out of bounds memory (either directly or by array oriented functions like strcpy)
- Using an unallocated object

my_class* obj_ptr;
obj_ptr -> var1 = val; // or val = obj_ptr -> var1;

- Using a freed object
Note: Memcheck (tool of valgrind) does not perform bounds checking for stack or global arrays, so even if you don't get this error/warning message, it does not mean that there is no memory corruption errors.
• Use of uninitialised values (either on stack or heap) in user function: Message looks like 'Conditional jump or move depends on uninitialised value(s)'. To see information on the sources of uninitialised data in your program, use the '--track-origins=yes' option.
• Use of uninitialised values in system calls: Message looks like 'Syscall param write(buf) points to uninitialised byte(s)'.
• Illegal free/deletions: (Invalid free())
- Calling free() (or delete) on the object that is already freed (or deleted).
- Deleting object that is created on stack.
• Mismatched operators: (Mismatched free() / delete / delete [])
- Calling free() on the object that is created by new operator.
- Calling delete on the object that is created by malloc().
- Calling delete[] on the object that is created by the new operator.
• Overlapping source and destination blocks: (Source and destination overlap in ...)
- Look out functions like memcpy, strcpy, strncpy, strcat and strncat.

References: