Niketan Pansare's blog: Computer Science

Showing posts with label Computer Science. Show all posts

Saturday, October 12, 2013

Notes on entrepreneurship (Part 2)

In the first class, we divided ourselves into random groups, each with four people (mostly from different background). The idea was to pitch an startup idea within 10 minutes based on the means of our group members. This exercise was repeated again with different set of people. One thing I noticed after this exercise was even though it seems like nice idea to start with your means and build a startup based on that, people usually revert back to the idea that they are absolutely passionate about, irrespective of their means [8].

As an assignment that week, we were supposed to take $5 and in span of two hours make as much money as we can. My group decided to go with "Grocery delivery service for professors" and each of us were supposed to ask professors in our department to help with that. I got to speak to only few professors in Computer Science department (as it was Friday) and only Swarat was on-board with the idea. My team-mates were unable to get any professors from their respective department, so they decided to go with another idea: "Personalized cards and their delivery". We made $10 in tips with the first idea and ~$13 with the second idea. The lesson from this exercise was sales is hard, but probably the important part of a startup [9].

In the next class, each of us gave an elevator pitch of their idea for a startup. Let me put my idea with respect to previous post:
1. Means:
- What do I know ? Background in software development, research experience in building large-scale systems and knowledge of embedded systems.
- What do I have ? Very low capital ... student salary :(
- Whom do I know ? Software professionals in India and US (from my bachelors, masters and job experience), Marketing/Advertising professionals in India (my father's advertising firm and my MBA friends), Trustworthy partner in India (my best friend and brother Rohan), Research scientists (my advisor, my colleagues at Rice university [3], my collaborators, contacts from internships and also from my experience as President of Rice Computer Science Graduate Student Association). Other than couple of exceptions, until now I have been lucky enough to be surrounded by really nice people, which is why I believe this to be my strongest means [4].

2. Ideation:
In the previous blogpost, I vented my frustration over sexual assaults cases in India as well as provided high-level suggestions (which I must admit I had no control over). So, I decided to use my means to develop something that might help improve the situation (in whatever little way possible).
Unlike US, the commonly-accepted safety net in India is not government/police, but the social structure (i.e. your friends and family). Many of the personal safety solutions like pepper spray, taser or guns require state licenses in many countries and are even prohibited in few countries like Canada, China, Bangladesh, Singapore, Belgium, Denmark, Finland, Greece, Hungary, etc. Not to mention they are expensive and can often escalate the situation. This is why many people follow a self-imposed curfew, that is either not leave home after dark or have someone accompany you.
With advent of smartphones, an obvious solution seems to be "have a personal safety app". Most of these apps have cool features such as sending GPS location to your friends and family but have a major flaw that make them non-functional: they require you to take out your phone, open the app and press an SOS button in it. Clearly, this is not what an average person would do in times of danger. Here are some of the feedback/comments on such apps [1]:
- I think a "dangerous environment" is the last place I would want an expensive item like an iPhone prominently displayed.
- ... app (might not be) readily available to access, on screen #8 somewhere on the 2nd or third row ...
- Pulling out your iPhone in the face of an attacker? That's one sure way to escalate the situation. And he now knows you have a nice phone too.
Reality check: Not only I am passionate about this problem, there does not exists a solution that works satisfactorily (i.e. "pain points") ... to me, they are just cosmetic app, the ones you buy for keepsake but does not serve any purpose to the society.

Before discussing next point, let me take a small detour and tell you what metric I use to determine the success of this startup [2]: At the end of the course (or may be few months past that), I am able to make an app that works seamlessly in real-life situation and which I can recommend to my loved ones without any hesitancy.

3. Opportunity evaluation:
Like many computer programmers, I have an habit of developing software that do exactly what I want but not what my audience would need. To ensure that it doesn't happen, I sent out an survey asking people what they would like in during times of emergency. The demographics of participants were as follows:
- 53% males and 47% females
- 82% of participants were between 20-30 years old and 14% between 30-40 years old.
- 59% from US, 16% from India and 8% from China
- 24% had Bachelors degree, 47% had Masters degree, 27% were PhD students.
- 32% earned between $10K-50K, 28% earned between $50K-100K, 16% earned above $100K and 20% were not currently earning.

Here is the summary of responses:

The above figures shows that majority of people (46%) wanted an external device like bluetooth-clicker that victim can press when he/she is in danger. This was a good indicator that I should go ahead and spend some time building such an app.

The next step was to understand what features a user would want/need in times of emergency:

Since some of these features require backend services (for sending emails/SMS, managing account/app) which need capital (I am not rich enough to maintain this kind of service on my own), I asked how much people are willing to pay for this kind of service. It was clear that most people preferred buy-once kind of a model:

Now, the checklist of opportunity evaluation:
a. Unique value proposition of my idea: "Affordable" and "hassle-free" way to connect to your friends, family and authorities in times of emergency ... with just a click of button.
b. Is it defensible: Nope, anyone with strong programming experience can replicate the features of my app. In fact, in long run, that is exactly what I want ... lot of good apps and competition that eventually help reduce the number of sexual assaults and make the world little safer place.
c. Is it profitable/sustainable: I really don't know :( ... It could be a product like dropbox, which people never thought they would want ... but once they got it, they can't imagine their life without it ... or it could be a total flop. The only way to be completely sure is by implementing it :)
d. Clearly defined customer: Young women traveling abroad or working late, senior citizens and frequent travelers.
e. Is it feasible and scalable: Yes, I used my experience in building large-scale cost-effective systems to build the backend. Also, my experience in C/C++ programming, knowledge of design patterns and user-friendly Apple documentation helped me: (a) learn the basics of Objective C in less than a week and (b) build the version 1.0 of the iOS app in less than a month.

Finally, I must reiterate the core principle of opportunity evaluation for a startup: It is not possible to know a priori whether an idea will turn out to be good business or not. So, I decided to stick to "affordable loss" principle and develop the app with as low cost as I could. Few of the hacks I used to ensure "affordable loss" are as follows:
- Moving most of the computing to clients rather than server (so as not to buy overly expensive servers).
- Using pay-as-you-use services wherever I felt absolutely necessary (for example: Amazon web-services).
- Using GIMP to develop my own logo (which I had to learn btw :P) rather than hiring a designer [6].
- Using wordpress for the startup website rather than spending days perfecting the CSS to make it mobile-compatible or hiring a web-developer.
- Focusing on minimum viable product (using Texas Instrument's SensorTag) rather than prematurely buying tons of bluetooth clicker from China [12].
- Using in-app purchases rather than setting up credit-card system in my website to provide features which other services charge me (for example: SMS/Email) [10].
- Not running after patents early on in the venture [5].
- Buying readily-made icons set rather than designing them yourselves [7].
- Choosing a hosting plan that has no hidden fees and that supports your choice of backend services.
- Using gmail as support email (rather than one provided by hosting services) and adding feedback button in app as well as contact form on the website (with some kind of captcha [13]). One tip I have for new developers is try to minimize the number of clicks/typing in app for sending feedbacks, for example: pre-fill "to-" address as well as "subject line".
- Utilizing the membership benefits of Apple/Google development program, i.e. off-loading testing [11], advertising/cross-promotion, expert feedback, reliable delivery of your software and version management.

References / footnotes:
[1] http://reviews.cnet.com/8301-19512_7-57452393-233/iwitness-app-claims-to-be-ultimate-deterrent-to-crime/
[2] Rather than use metrics like expected profit/revenue after first year or public offering or something on similar lines.
[3] I already have got 4 other PhD students from Rice university on-board to develop Android/Windows version of the app. This is in lines with another principle of Effectual Entrepreneurship: Form partnerships.
[4] The intent of this statement is not flattery, just an observation. Here is why I think so: though I never deliberately tried to enforce it, my circle of influence consists of three types of people: those who are genuinely nice, those who are smarter than me, and non-self-destructive people. Of course, the categories are not mutually exclusive and in fact people who are smarter are usually genuinely nice (probably because they only focus on self-upliftment, not on pulling others down). Here is a layman example most people can relate to: even in course with relative grading, a smart person knows that he/she is a student of global class and is not be threatened by his/her class-mates' progress ... which is probably why you will rarely find such a student shying away from group discussion (so as to gain advantage) or deliberately spreading false information to sabotage other's grades. May be things might not be such black-and-white in other fields, which is one of the reason why I absolutely love research and development.
[5] A patent attorney who attended our class thought one of the feature of our app is patentable, which might be true. But there are 2 problems with going that way: (a) the real cost of patent is not in getting it, but in defending it, (b) It will limit the features that other programmers who are smarter than me can introduce in their personal safety app (which goes against my success criteria for this particular problem). To be completely honest, there are ways around low-capital issue for those startups who really want to get a patent: (1) file provisional patent yourself under $200 (just read about how to define the scope of your patent and also about court dates), (2) ask your parent organization or angel investor to file a patent for you in exchange for royalties or stake in the startup (for example: Rice university's OTT office), (3) contact a patent troll to defend you patent.
[6] For people with little more budget, there are websites like http://99designs.com/ and https://www.elance.com/ that allows you to hire a free-lance designers/developers. A similar website for building mock prototypes of your products to show to investor is http://www.protomold.com/.
[7] There is always a tradeoff between time and money, you just have to figure out what is your exchange rate for time ;) ... for example: if someone is providing you service that will save you 1 hour, how much are you willing to pay for that service.
[8] Whether building your startup "based on your existing means" is better than "based on your passion and then expanding your means as you go" or vice versa, I really don't know. There are obvious advantages for both and also obvious disadvantages when pushing the respective principle to extreme.
[9] Even though you might think an idea is pretty good (and will benefit the customer), people are not willing to pay as much as you think ... probably because it's either suspicion that you are trying to dupe them or incorrect valuation of the product/market from your end or something else.
[10] Though it might seem simple to just plug-in existing credit-card library and setup a php page with mysql backend, things get a little complicated when you start thinking about transactional semantics and the fact that people can buy new devices or deleting the app and similar situations.
[11] In traditional company, a developer would be evaluated based on stupid metrics, such as bugs assigned, solved, etc. Though on paper they seem apt, they can have harmful side-effects for startup such as spending too much time perfecting a feature without any customer validation/feedback. So, instead of spending significant resources on testing, submit your app to Apple as and when you add new feature and indirectly ask them to test it :) ... Other way, for cheap testing is by using crowd-sourcing websites such as Amazon's mechanical turk (which I will ignore for this post).
[12] Since the SensorTag took a while to be delivered to my home address (thanks to my apt complex rejecting the package), I decided to use an accessory that I already owned for version 1.0 (i.e. headphones) and introduce bluetooth feature in the next version.
[13] There are lot of wordpress plugins that allow you to add captcha in your website in just few minutes.

Thursday, March 14, 2013

Video tutorial on reinforcement learning

This video tutorial is based on a talk I gave at my Comp 650 class and hence assumes that the audience knows a little about Markov Decision Processes and Partially Observable Markov Decision Processes. If you are unfamiliar with them, I recommend that you at least read through their wikipedia pages before going through this tutorial. Also, I suggest that you view this video on youtube, because the embedded video given below is cropped.

If you have any suggestions, comments or questions, please feel free to email me :)

PDF slides

Thursday, February 02, 2012

Cross-validation on SVM-Light

Before reading this blogpost you should read: Practical guide to libSVM, Wikipedia entry on SVM, Wikipedia entry on cross-validation, and PRML - Bishop's chapter on SVM. In this post, I will not explain how to use open source SVM tools (like libsvm or svmlight), but would rather assume that you have used them and want some clever script to automate them. Also note that, the below script uses default parameters. So if you get poor results, you can add an outer for loop that iterates through your choice of parameters and supplies them to the program ./svm_learn.

To perform cross-validation on libsvm:

Download and extract it
You need gnuplot installed on the machine you want to run libsvm. If you want to run it on the server, do 'ssh -X myuserName@myMachineName'
make
Go to Tools folder and run 'python easy.py my_training_file my_test_file'

To perform cross-validation on svmlight:

Download and extract it. There you will find two important binaries: svm_learn and svm_classify
Unfortunately the data gathering using svmlight is not easy, so I have coded following script.

#!/bin/bash
# Usage: svm_light_cross_validation full_data num_train_items num_test_items k_cycles results_file
RESULT_FILE=$5

echo "Running SVM-Light via cross validation on" $1 "by using" $2 "training items and" $3 "test items (Total number of cross-validation cycles:" $4 > $RESULT_FILE

MODEL_FILE="model."$RANDOM".txt"
TEMP_FILE="tempFile."$RANDOM".txt"
PRED_FILE="prediction."$RANDOM".txt"
DATA_FILE=$1
NUM_TRAIN=$2
NUM_TEST=$3
NUM_CYCLES=$4

TEMP_DATA_FILE=$DATA_FILE"."$RANDOM".temp"
TRAIN_FILE=$TEMP_DATA_FILE".train"
TEST_FILE=$TEMP_DATA_FILE".test"

TEMP_RESULT=$RESULT_FILE".temp"
SVM_PATTERN='s/Accuracy on test set: \([0-9]*.[0-9]*\)% ([0-9]* correct, [0-9]* incorrect, [0-9]* total)\.*/'
for k in `seq 1 $NUM_CYCLES`
do
 sort -R $DATA_FILE > $TEMP_DATA_FILE
 head -n $NUM_TRAIN $TEMP_DATA_FILE > $TRAIN_FILE
 tail -n $NUM_TEST $TEMP_DATA_FILE > $TEST_FILE
 
 echo "------------------------------------------"  >> $RESULT_FILE
 echo "Cross-validation cycle:" $k >> $RESULT_FILE
 
 # first run svm with default parameters
 echo ""  >> $RESULT_FILE
 echo "Polynomial SVM with default parameters" >> $RESULT_FILE
 for i in 1 2 3 4 5 6 7 8 9 10
 do
  echo "order:" $i >> $RESULT_FILE
  ./svm_learn -t 1 -d $i $TRAIN_FILE $MODEL_FILE > $TEMP_FILE
  ./svm_classify -v 1 $TEST_FILE $MODEL_FILE $PRED_FILE > $TEMP_RESULT
  cat $TEMP_RESULT >> $RESULT_FILE
  sed '/^Reading model/d' $TEMP_RESULT > $TEMP_RESULT"1"
  sed '/^Precision/d' $TEMP_RESULT"1" > $TEMP_RESULT
  sed "$SVM_PATTERN$k poly $i \1/g" $TEMP_RESULT >> "better"$RESULT_FILE
 done

 echo ""  >> $RESULT_FILE
 echo "RBF SVM with default parameters" >> $RESULT_FILE
 for g in 0.00001 0.0001 0.001 0.1 1 2 3 5 10 20 50 100 200 500 1000
 do
  echo "gamma:" $g >> $RESULT_FILE
  ./svm_learn -t 2 -g $g $TRAIN_FILE $MODEL_FILE > $TEMP_FILE
  ./svm_classify -v 1 $TEST_FILE $MODEL_FILE $PRED_FILE >> $TEMP_FILE
  cat $TEMP_RESULT >> $RESULT_FILE
  sed '/^Reading model/d' $TEMP_RESULT > $TEMP_RESULT"1"
  sed '/^Precision/d' $TEMP_RESULT"1" > $TEMP_RESULT
  sed "$SVM_PATTERN$k rbf $g \1/g" $TEMP_RESULT >> "better"$RESULT_FILE
 done

done

rm $MODEL_FILE $TEMP_FILE $PRED_FILE $TEMP_DATA_FILE $TEMP_RESULT $TEMP_RESULT"1"
echo "Done." >> $RESULT_FILE

Here is how you run this script:

./svm_light_cross_validation myDataInSVMLightFormat.txt 5000 2000 10 MyResults.txt

myDataInSVMLightFormat.txt looks like "binaryClassifier(-1 or 1) featureNum:value ....". Example:

1 1:1.75814e-14 2:1.74821e-05 3:1.37931e-08 4:1.25827e-14 5:4.09717e-05 6:1.28084e-09 7:2.80137e-22 8:2.17821e-24 9:0.00600121 10:0.002669 11:1.19775e-05 12:8.24607e-15 13:8.36358e-10 14:0.0532899 15:3.25028e-06 16:0.148439 17:9.76215e-08 18:0.00148927 19:3.69801e-16 20:4.20283e-16 21:7.9941e-05 22:5.7593e-09 23:0.000251052 24:0.000184218 25:7.07359e-06

After running this script you will get two important files: MyResults.txt and betterMyResults.txt (which is a space separated file with format: 'validation_cycle_num [poly or rbf] parameterValue accuracy'). You can then import betterMyResults.txt in R or Matlab and get some pretty graphs :)

Wednesday, February 01, 2012

Using VIM for "fast" data transformation

Recently, I had to give my data to a lab-mate. I had done some pre-processing on 20 Newsgroup dataset and converted it to following format:
- Each line represented a document.
- Words were replaced by an integer that denoted it's index in vocabulary.txt file
- The line was append with a binary classifier (1 or -1) based on whether the document was in "comp" newsgroup or "sci" newsgroup.
- The words were represented along with their frequency.
For example, let's assume for now the word "technology" is 1002th word in the vocabulary. Then, if the word "technology" appeared 10 times in 12th document (in the folder comp.graphics), then 12th line in my file will be:
1 1002:10 ...

Here is a sample line in the file:

1 3798:1 9450:1 12429:1 13352:1 14155:1 15858:1 22319:1 29652:1 31220:1 34466:2 35883:1 37734:1 40188:1 40683:1
1 90:1 1078:1 2101:2 3775:1 5183:2 5195:1 5747:1 7908:1 7963:1 9748:1 11294:3 14879:1 16006:2 18188:1 19742:1 20928:1 21321:1 21935:1 23613:1 25354:2 26721:1 29652:1 30407:1 34054:1 36546:2 38252:1 39376:2 40204:1

Now, I had to convert this to following format:
fileNum | wordIndex | frequency
For example:

3|3798|1
3|9450|1
3|12429|1
3|13352|1
3|14155|1

I used following VIM tricks:
1. First delete the binary classifiers (and also blank spaces at the end of the line):

:%s/^-1//g
:%s/^1//g
:%s/ $//g

2. Then, comes the most important part: Insert line number before every word.
To do this, I used following trick: - First replace spaces by some arbitrary character, in my case '='
- Then, replace that character by the line number !!!

:%s/ / =/g
:g/=/exec "s/=/ ".line(".")."|/g"

The last command needs some explanation:

It says, execute a command globally whenever you see '=' character: :g/=/exec some-command

In our case the command is replacing that character by the line number. The line number of a given line can be found using VIM's internal function: line(".")

Hence, the input file is converted into following format:

3|3798:1  3|9450:1  3|12429:1  3|13352:1  3|14155:1  3|15858:1  3|22319:1  3|29652:1  3|31220:1  3|34466:2  3|35883:1  3|37734:1  3|40188:1  3|40683:1
  4|90:1  4|1078:1  4|2101:2  4|3775:1  4|5183:2  4|5195:1  4|5747:1  4|7908:1  4|7963:1  4|9748:1  4|11294:3  4|14879:1  4|16006:2  4|18188:1  4|19742:1  4|20928:1  4|21321:1  4|21935:1  4|23613:1  4|25354:2  4|26721:1  4|29652:1  4|30407:1  4|34054:1  4|36546:2  4|38252:1  4|39376:2  4|40204:1

3. Now, to some garbage cleaning and trivial replacement:

:%s/:/|/g
:%s/  / /g
:%s/^ //g

4. Finally, replace spaces by a newline character:

:%s/ /!!!!!!/g where !!!!!! is press Control+V and then press Enter

Control-V is the special character escape key. Remember the character '\n' won't work here.

And voila you are done !!!

Tuesday, January 17, 2012

Iris - siri like software

Recently I have started a pet project Iris¹ while doing background study for my recent paper.

Voice commands/response:
Like Siri, it takes user commands and can respond using either Voice or Text.

Portable:
Since it is written in Java, it can run on most OS, including Mac, Windows and Linux. I have only tested it on Mac Lion, so I welcome any beta testers using other OS. For now, some of the features like play itunes, open email, etc are only available on Mac. I plan to add these features on other OS as well.

Configurable:
Most of the features such as use of VOICE or TEXT as feedback mechanism, as well as feedback message are configurable through the configuration file ''user_scripts/iris.config". You can also modify any of the predefined apple scripts (for eg: playSong.scpt) in that folder which will be called by Iris on the voice cammand. Also, for advance users, there are ten function commands that he/she can use to perform user-specific tasks.

Easy to install and use:
All you need to do is download and extract the binary file and run following command:

 
java -Xmx200m -jar bin/Iris.jar

The above command asks Java to run the program Iris.jar (in bin folder) and also specifies that it should not use more than 200 MB of memory.

Lightweight:
The whole setup takes about 20MB.

Client-side:
The speech-to-text is done on your laptop/desktop and not on some remote server. This means you can use it even if have no internet connection. This also means that there is no possibility of some big brother monitoring your voice commands.

Open source:
You are free to download and modify the source. As I am working towards a paper deadline, I am looking for collaborators that would help me with following features:

User-friendly: Improving voice commands
More accuracy: Expanding dictionary to include more words and also to add different 'user accents' for a given word.
Expanding grammar to include arbitrary spoken words.
Apple scripts to accessing contacts and recognizing contact names to assist users with various functionality.
Adding GUI interface rather than command line. Since all the responses go through OutputUtility.java, this is going to be easy.
New features: Support for RSS, Top News, Weather, Popular city names, Dictionary/Theasurus, Web/Wikipedia search, Dictation (to text/messages/events/todo item/alarm/reminders), Jokes, Motivational quotes, More Personalization.

If you have suggestions for improvement or want to collaborate with me, email me at 'NiketanPansare AT GMail'.

Iris uses CMU's Sphinx4 for recognizing speech and FreeTTS for text-to-speech conversion.

References:
¹ Iris is name of my friend's dog (well, no pun intended when I use the term 'pet project'). It is also the anagram of the word 'Siri' (a popular Apple's software for IPhone 4s).

Monday, January 02, 2012

C++ workflow

Recently I had a memory bug that took me 2 days to figure out. Valgrind gave no error message (well, to be honest, it did give out a warning that got lost in my log messages). gdb pointed to completely random place that led me to believe STL's stringstream was the culprit. Since there was no way I could have definitively say it's not the cause, I couldn't rule out that possibility.

My impatience got the best of me and I foolishly rewrote that part of code using a less elegant solution using arrays. Now gdb started pointing to STL's vector.clear() method and believe it or not I was using std::string to store in it, not some user defined class …. I had it … there is no way in the world I am going to believe that gdb was right (thanks to my confidence in Effective C++ and std::vector < str::string >). But, not knowing the bug in a 'single threaded' 'command-line based' C++ code with access to gdb and pretty printing was driving me crazy. My previous blogpost didn't help either :(

So, on the same evening, armed with my music player and hot cup of tea, I decided to ignore the bug and beautify my code with better comments … geek alert !!! That's when it hit me, due to my recent modification, I was writing past the array bounds in a for loop. The funny thing is it is one of the few memory bugs that valgrind does not report in its final assessment and my best guess is that I was writing over vector's header location (and vector only complains about that during its next operation (in my case was clear()) ... which for all intensive purposes is random).

Embarrassed by this stupid mistake that wasted my two days, I felt humbled to revisit my C++ work-flow.

So, here is a sample workflow:
- svn update
- Code using your favorite editor (Make sure you configure it to conform to your style rules).
- Set DEBUGFLAGS in your makefile that does static analysis (make my_proj).
- Test it on local machines using 'make my_test'
- Run valgrind if necessary and check against below given valgrind checklist
- make style
- make doc
- svn update
- svn commit -m "some meaningful message"

The key principle is:

Automate whenever possible (with emphasis on reusability)

Never ever start coding without an SVN (or cvs or git, etc.). It will save you lot of headaches if your machine crashes or if you want to revert back to a previous version.
If the project has multiple coders/testers/collaborators, it helps if you have some kind of project-management tool like trac or jira. Also, if you are going to release your code to public, you might want to consider project hosting sites like Google Code or source forge.
Though this might seem obvious, don't host the svn server on personal machine or at least make sure to take regular backups if you do.

Use a Makefile that takes care of all necessary tasks (compilation, cleaning, data generation, document generation, code beautification, etc) and a user-friendly README file. (Advance coders can consider automate). Here is a sample Makefile of my current project:

 
CC = g++
DEBUGFLAGS = -g -Wall -Wextra
MINUITFLAGS = `root-config --cflags` `root-config --libs` -lMinuit2 -fopenmp
LIBS = -lgsl -lgslcblas -lm -lstdc++

HEADER = Helper.h MathHelper.h LDA.h Configuration.h
SRC = main.cpp MathHelper.cpp Helper.cpp LDA.cpp
OUTPUT = spoken_web

spoken_web:     $(SRC) $(HEADER)
        $(CC) $(SRC) -o $(OUTPUT) $(DEBUGFLAGS) $(LIBS)

doc:    $(HEADER) spokenweb_doxygen_configuration
        doxygen spokenweb_doxygen_configuration 

r_data: GenerateData.R
        R --slave < GenerateData.R

style: $(SRC) $(HEADER)
        astyle --options=spokenweb_astyle_rules $(SRC) $(HEADER)

clean:
        rm -f *.o $(OUTPUT) *~ *.data parameters.txt metaFile.txt *.data

Use a document generator like doxygen. It forces you to write much cleaner documentation and more importantly cleaner interface. Trust me it will help if you had to reuse your code in future.
```
make doc
```
Use code beautifier like astyle that not only shields you against inelegant colleague/coder, but also personalizes the code if you like to do 'Code Reading' (Don't you clean gym equipment after workout, well, why not do the same with your code). Here is sample style rules that I love:
```
 
--style=java
--indent=spaces=2
--indent-classes
--indent-switches
--indent-preprocessor
--indent-col1-comments
--min-conditional-indent=0
--max-instatement-indent=80
--unpad-paren
--break-closing-brackets
--add-brackets
--keep-one-line-blocks
--keep-one-line-statements
--convert-tabs
--align-pointer=type
--align-reference=type
--lineend=linux
--suffix=none
```
If your company follows other style guidelines or if you don't want to change existing formatting, remove '--suffix=none' from the style rules and once you are done, you can restore the original unformatted files (*.orig).
```
make style
```
Familiarize yourself with program analyzers and debugging tools, and have scripts that automate them in your coding process:
- g++'s flags (See DEBUGFLAGS in my makefile)
- Static analysis tools like cppcheck (For more cases, I am satisfied with the above g++'s flags)
- Runtime memory analyzer tool like valgrind. (Please see below given Valgrind checklist, and also my previous blogpost).
- gdb: For stepping through the code for more comprehensive debugging
There are several paid tools that do much better job of analyzing your code that those mentioned above. There are
I like to follow standard directory structure:
./ Makefile and configure scripts
./src cc , h files (Use subdirectories and namespaces if necessary)
./bin build directory
./data data required for testing/running the code
./test test drivers
./doc documentation generated by doxygen

Maintain a list of code snippets (stored in Evernote or a personal notebook) or helper classes that you might have to use again. Note that the helper classes should have good interface and must be well documented (and robust). For example, I use GSL for writing sampling functions for bayesian programs, which has very ugly interface and requires some amount of book-keeping. So, I have coded a wrapper class that gives an R like interface which makes my code extremely readable, less error prone and I don't need to worry about any book-keeping or other numerical issues while calling them.

 
/// \defgroup StatDistribution R-like API for family of probability distributions
/// Naming convention:
/// I have followed the naming conventions of R programming language. 
/// Please see its documentation for further details
///
/// Here is a brief intro:
/// Every distribution that R handles has four functions. There is a root name, 
/// for example, the root name for the normal distribution is norm. 
/// This root is prefixed by one of the letters
///   - d for "density", the density function
///   - r for "random", a random variable having the specified distribution
/// Example: For the normal distribution, these functions are dnorm, and rnorm.
///
/// Important note: Since most bayesian calculation is done in log space, (almost) 
/// every density function is capable of outputing 'numerically robust' logarithmic value. 
/// Hence, use dnorm(x, mean, sd, true) in your code rather than log(dnorm(x, mean, sd))
///
/// Suported continuous distribution:
/// - Uniform (unif - r,d)
/// - Univariate Normal (norm - r, d); Truncated normal (tnorm - r,d);
/// - Multivariate Normal (mnorm - r, d, conditional_r, conditional_rt) (--> Only m = 2, 3 supported)
/// - Beta (beta - r,d)
/// - Inverse Gamma (invgamma - r,d)
///
/// Suported discrete distribution:
/// - Discrete Uniform (dunif - r,d)
/// - Dirichlet (dirichlet - r,d)
/// - Multinomial (multinom - r,d)
/// - Poisson (pois - r,d)
/// @{

/// @name Continuous Uniform Distribution
///@{
/// Returns uniform variable [min, max)
double runif(double min, double max);
double dunif(double x, double min, double max, bool returnLog = false);
///@}

/// @name Discrete Uniform Distribution
///@{
/// Returns uniform variable [min, max) --> i.e. [min, max-1]
int rdunif(int min, int max);
double ddunif(int x, int min, int max, bool returnLog = false);
///@}

/// @name Univariate Normal Distribution
///@{
double rnorm(double mean, double sd);
double dnorm(double x, double mean, double sd, bool returnLog = false);
///@}

/// @name Multinomial distribution
/// Both prob and retVal should be of length sizeOfProb
/// size, say N, specifying the total number of objects that are put into 
/// K (or sizeOfProb) boxes in the typical multinomial experiment.
/// If the array prob is not normalized then its entries will be treated as 
/// weights and normalized appropriately. Furthermore, this function also allows 
/// you to provide log probabilities (in which case, you need to set isProbInLogSpace to true)
/// Eg: double prob[3] = {0.1, 0.2, 0.7}; double retVal[3]; rmultinom(1, 3, prob, retVal, false);
///@{
void rmultinom(int size, int sizeOfProb, double* prob, unsigned int* retVal, bool isProbInLogSpace = false);
double dmultinom(int sizeOfProb, double* prob, unsigned int* x, bool isProbInLogSpace = false, bool returnLog = false);
///@}

So to sample N(10,2), instead of having code like:

 
gsl_rng* rng_obj = gsl_rng_alloc(gsl_rng_mt19937);
double sampledVal = gsl_ran_gaussian(rng_obj, 2) + 10;
gsl_rng_free(rng_obj);

I have to use following code (which if you have coded in R is more readable):

 
StatisticsHelper myStatHelper;
double sampledVal = myStatHelper.rnorm(10, 2);

Valgrind checklist:

First run valgrind's memcheck tool (valgrind --tool=memcheck program_name) and check:

The heap summary (only returns memory leaks, not memory corruption errors) Eg:
```
 
==19691== HEAP SUMMARY:
==19691==     in use at exit: 0 bytes in 0 blocks
==19691==   total heap usage: 12,126 allocs, 12,126 frees, 938,522 bytes allocated
==19691== 
==19691== All heap blocks were freed -- no leaks are possible
```
If number of allocs and the number of frees will differ in o/p, the try
valgrind --tool=memcheck --leak-check=yes --show-reachable=yes program_name

Steps to avoid memory leaks:
- If you are assigning pointers to other pointers, be very explicit as to who is responsible for the deletion of the memory.
- Delete every object created by new (especially in case of exceptions).
Intermediate warning/error messages irrespective of the heap summary. Eg:
```
 
==15633== Invalid write of size 4
==15633==    at 0x4163F3: LDAGibbsSampler::FillInSpeechToTextOutputs() (LDA.cpp:206)
==15633==    by 0x41565A: LDAGibbsSampler::LDAGibbsSampler(int, int, double, double, std::string, double, double, int) (LDA.cpp:28)
==15633==    by 0x402E26: main (main.cpp:34)
```
Note, sometime one error can propagate and throw error messages at different places, so don't panic.

The intermediate error messages can detect:
- Illegal read / Illegal write errors (like the message 'Invalid read of size 4' above): It can mean any of the memory corruptions given below. If the error message is not informative, run valgrind again with '--read-var-info=yes' flag.
  Common memory corruption bugs like these are:
  - Writing/Accessing out of bounds memory (either directly or by array oriented functions like strcpy)
  - Using an unallocated object
```
 
my_class* obj_ptr;
obj_ptr -> var1 = val; // or val = obj_ptr -> var1;
```
  - Using a freed object
  Note: Memcheck (tool of valgrind) does not perform bounds checking for stack or global arrays, so even if you don't get this error/warning message, it does not mean that there is no memory corruption errors.
- Use of uninitialised values (either on stack or heap) in user function: Message looks like 'Conditional jump or move depends on uninitialised value(s)'. To see information on the sources of uninitialised data in your program, use the '--track-origins=yes' option.
- Use of uninitialised values in system calls: Message looks like 'Syscall param write(buf) points to uninitialised byte(s)'.
- Illegal free/deletions: (Invalid free())
  - Calling free() (or delete) on the object that is already freed (or deleted).
  - Deleting object that is created on stack.
- Mismatched operators: (Mismatched free() / delete / delete [])
  - Calling free() on the object that is created by new operator.
  - Calling delete on the object that is created by malloc().
  - Calling delete[] on the object that is created by the new operator.
- Overlapping source and destination blocks: (Source and destination overlap in ...)
  - Look out functions like memcpy, strcpy, strncpy, strcat and strncat.

References:
- http://google-styleguide.googlecode.com/svn/trunk/cppguide.xml
- http://www.yolinux.com/TUTORIALS/C++MemoryCorruptionAndMemoryLeaks.html
- http://www.cprogramming.com/tutorial/memory_debugging_parallel_inspector.html

Tuesday, October 11, 2011

How to analyze algorithms ?

Important assumptions:

This process of analysis should be independent of programming language and machine architecture. So you can't just write your algorithm in C++ and compare it using running time of some other algorithm written in Java.
Also, since each algorithm can take as an input arbitrary data structure (eg: graph, array, binary object, tree, ...), there has to be some standard way to specify it. For sake of our analysis, we use input size (n) as the measure for our analysis. For eg: length of array, number of vertices or edges of the graph, number of nodes of a tree etc.

The detailed explanations of these two points is given in Introduction to Algorithms by Cormen, Leiserson, Rivest, .. (CLR)

Setup:

Input of size n ==> Algorithm that takes T(n) steps ==> some output

The number of steps in the algorithm often depends on factors other than the input size. For eg: the number of steps for quicksort for the input [1, 2, 3, 4, 5] is different than as compared to the input [5, 4, 3, 2, 1]; even though both have same input size (and in this case same elements).
Hence, we will try to find 3 functions (upper bound, lower bound and Theta bound) that can capture the effects of these factors: $$ g_1(n), g_2(n) \text{ and } g_3(n) \text{ such that } $$ $$ T(n) \in O(g_1(n)), T(n) \in \Omega(g_2(n)) \text{ and } T(n) \in \Theta(g_3(n))$$ Many authors also write above equations as: $$ T(n) = O(g_1(n)), T(n) = \Omega(g_2(n)) \text{ and } T(n) = \Theta(g_3(n))$$

Big-O notations:

Now, let's understand definition of above notations: \begin{align} \text{To prove that } & g_1(n) \text{ is upper bound of T(n), i.e.} T(n) \in O(g_1(n)) \\ \text{ find two constants } & c, n_0 \text{ such that } \\ & \forall n \geq n_0, T(n) \leq c g_1(n) \end{align} For lower bound, change ≤ sign to ≥ sign and for theta bound find both lower and upper bound: $$ T(n) \in \Theta(g_3(n)) \text{ iff } T(n) \in O(g_3(n)) \text{ and } T(n) \in \Omega(g_3(n))$$
To understand these in more details, I suggest you solve problems from CLR chapter 3. Here are some example you can start with: \begin{align} 2n^2 \in O(n^3) & \text{ with } c=1 \text{ and } n_0 = 2 \\ \sqrt{n} \in \Omega(lg n) & \text{ with } c=1 \text{ and } n_0 = 16 \\ \frac{n^2}{2} - 2n \in \Theta(n^2) & \text{ with } c_1=\frac{1}{4}, c_2=\frac{1}{2} \text{ and } n_0 = 8 \\ \end{align} (Note: theoretically very large functions such as infinity^n or n^infinity can always be lower bounds. So, it is necessary to supply as tight bound as possible for any given algorithm. Mathematically, the notations for tight upper bound is small o and for tight lower bound is small omega, but we will ignore them for sake of this discussion)

Common measures for algorithms:

Above three notations are often replaced by the terms 'worst case', 'best case' and 'average case' complexity. But, more precise definitions are as follows: \begin{align} \text{Worst case}& T(n) = max \{ T(x) | \text{ x is an instance of size n} \} \\ \text{Best case}& T(n) = min \{ T(x) | \text{x is an instance of size n} \} \\ \text{Average case}& T(n) = \sum_{\text{input }x} Prob(x) T(x) \end{align} This means, to find average case complexity of algorithm, you need to find probability of a given input and use it to find expectation of T(n).

Pseudo code to Big-O:

Now, let's address the key question: given a pseudo code, how do we find its lower bound (or upper bound or ...). Here is one possible way to find those bounds:

First, express the algorithm in recursive form
Determine appropriate metric for input size 'n'
Then, find recurrence relation for the algorithm
Solve the recurrence relation using either Substitution method, recursion tree method or the master theorem

For example, the recurrence relations for popular algorithms are: \begin{align} \text{Binary search: } & T(n) & = T(\frac{n}{2}) + O(1) \\ \text{Sequential search: } & T(n) & = T(n-1) + O(1) \\ \text{Tree traversal: } & T(n) & = 2T(\frac{n}{2}) + O(1) \\ \text{Selection sort: } & T(n) & = T(n-1) + O(n) \\ \text{Merge sort: } & T(n) & = 2T(\frac{n}{2}) + O(n) \end{align} Try writing the pseudo code for any of the above algorithm and verify the recurrence relations.

Master theorem:

Master theorem can only solve the recurrence relation of the form: T(n) = a T(n/b) + f(n). Here is my modus operandi for master theorem: \begin{align} \text{If } f(n)=n^{log_{b}a}lg^{k+1}n, & \text{ then } T(n) = \Theta(n^{log_{b}a} lg^{k+1}n) \\ \text{Else if } n^{log_{b}a} \text{ is polynomially greater than } f(n), & \text{ then } T(n) = \Theta(n^{log_{b}a}) \\ \text{Else if } f(n) \text{ is polynomially greater than } n^{log_{b}a}, & \text{ then } T(n) = \Theta(f(n)) \\ \text{Else use Substitution method}& \end{align}
Here are solutions to two important recurrence relations, that cannot be solved using Master theorem: \begin{align} \text{Recurrence relation} & & \text{Complexity} \\ T(n) = T(n-1) + b n^k & => & O(n^{k+1})\\ T(n) = cT(n-1) + b n^k & => & O(c^n) \end{align}
For recurrence relation of other forms, refer to CLR for substitution and recursion tree method.
(In short, for substitution method, you guess the solution and use induction to find the constants and to prove that the guessed solution is correct.
Recursion tree method are often used to find the guess for the substitution method)

Some useful approximations:

For non-negative real b,c,d:
\begin{align} \text{Polynomial growth rate:}& \sum_{i=1}^n i^c &= \Theta(n^{c+1}) \\ \text{Logarithmic growth rate:}& \sum_{i=1}^n \dfrac{1}{i} &= \Theta(log(n)) \\ \text{Exponential growth rate:}& \sum_{i=1}^n c^i &= \Theta(c^n) \\ \text{n-Log(n) growth rate:}& \sum_{i=1}^n log(i)^c &= \Theta(n log(n)^c) \\ \text{Combination of above:}& \sum_{i=1}^n log(i)^c i^{d} &= \Theta(n^{d+1} log(n)^c) \\ &\sum_{i=1}^n log(i)^c i^d b^{i} &= \Theta(n^{d+1} log(n)^c b^{n}) \\ \text{Stirling's approximation:}& n! &= \sqrt(2 \pi n) (\frac{n}{e})^n (1 + \Theta(\frac{1}{n})) \end{align}

Thursday, September 29, 2011

Presentation on Online Aggregation for Large MapReduce Jobs

This post has two purposes:
1. Give information about my Comp 600 presentation at Rice University
2. Enumerate the lesson's learnt while preparing and giving the presentation.

(If the above videos does not work, click on: Dailymotion or Vimeo)

Also, if you have time, go through my slides or better still read my paper :)
- Slides
- Paper

FAQs during the presentation at VLDB:

1. What if the blocks are equally sized and spread across uniformly ?
- We used exactly that configuration. The input blocks of wikipedia logs are approximately of same size and HDFS does good job of spreading the data across uniformly. Still, we found biased results in the beginning.

2. Does your system replace general database ?
- Nope. As of now, it only performs single table aggregation with group-by. For a general online database, see DBO paper by our group.

3. Will you release your code ?
- Yes, we intend to release our code as open-source. For current version, see http://code.google.com/p/hyracks-online-aggregation/

4. Who do you see as potential user of the system ?
- Any company that has massive data stored in HDFS (logically as single table) and want to perform aggregation in online fashion or only till certain accuracy. Note, that the table containing the data need not be pre-randomized.

5. What does user of the system needs to know ?
- Just how to write and run a hadoop job ... All he/she has to do is write a hadoop mapper and set appropriate parameters in job configuration. For details see appendix of our paper.

6. You didn't explain too much statistics in the talk ... I want to know more
- Yes, you are absolutely right. Infact, that was intentional :) ... If I have to give one-line summary of statistics here is it: Sample from a Multi-variate normal on value, processing time and scheduling time (which have correlation factor associated with between each of them) which takes in account the 'lower bound' information of processing time of the blocks that are currently been processed ... rather than simply sampling from the distribution over data.

7. You don't talk about performance ... Isn't randomized scheduling policy bad for performance ?
- We didnt talk about the performance, mainly because we didnot provide any comparative study for performance in our paper. Note that, our policy is not same as a strict randomized policy. We expect the master to maintain a randomize queue and start the timer when a worker asks for the block. If for some reason (say locality), worker doesnot want that block or master has some internal (locality-based) scheduling algorithm, he can leave the timer ON and go for next block in the randomized queue. This is necessary machinery for sampling theory. This will yield performance that is similar to locality-based scheduling. However, in our current implementation, we have not used this mechanism. We intend to do the performance study in our future work.

Presentation tips:

Clearly my presentation has lot of room for improvement, for example: need to talk slower, more uniform layout of the slides, more interaction with audience etc. However, I have learnt a great deal while preparing and giving this presentation. I will summarize few points that I (and probably lot of graduate students) need to be careful about. For more detailed analysis of these points, I suggest you read Presentation Zen, Slide:ology and When the scientist present.

Step 1: Planning your presentation
1. Create a relaxing environment
- Go to a library, coffee place, garden (place where no one will disturb you)
- Take time to relax and forget about daily chores.
- Don't bring your laptop. In this step, you will create rough sketch of the presentation on paper (not slideware)
- Lot of the time, changing locations helps you to be more creative.

2. Make sure you understand the big picture of your presentation.
Can you find 30-45 seconds answer to following questions:
- What is the single most important "key" point that I want my audience to remember
- Why should that point matter to the audience

Also, make sure that your presentation title has keywords that highlights the big picture (as many people (if not all) will come to your talk either reading just the title (and abstract) or your name). This does not mean the title has to be too long; in fact shorter and thought-provoking title are preferred.

3. Now, create a sketch of the presentation. This involves enumerating key ideas you will be presenting and also organising/ordering them. You can use either or all of the following tools:
- Mindmap on whiteboard or paper
- Post-it (use sharpie to make sure you don't misuse post-it)
- Brainstorm the list of ideas + Grouping them
- Sketching or drawing
Nancy Duarte says most effective presentations involves telling a story that audience can connect and relate to, rather than just present the information/data. Also, best professors are good story-tellers. They just don't go through the material in the book, but put their own personality, character and experience into the material in form of a narrative, which is illuminating, engaging and memorable.

The most useful point I have learnt from Presentation Zen is that the presentation consists of three parts:
- Slides the audience will see (which you will create in Step 2)

- Notes only you will see (which you have created in this Step)

- Handouts that audience can take away (so that they don't have to jot down key definition or formula). This in worst case could simply be a copy of your paper.

This essentially means that the slides are not telepromptor, that you read aloud throughout your presentation.

Step 2: Design the slides
To succeed as a presenter, you need to think as a designer (from slide:logy).

1. First few slides are very very important !!!
- I like to give use them to tell the audience the big picture of the presentation.
- It is also necessary that this has to be catch their attention, as we all know the attention-span of audience decreases exponentially after first few minutes. Don't use the first few minutes reading through the 'Outline' slide.
- Use following principle: Tell them what you are going to present, present it, then tell them again what you presented :)

2. Slides should be as simple as possible.
- Don't use fancy, distracting background images. I prefer sticking to simple white background slide.
- Good design has plenty of empty spaces.
- Avoid typical bullet point presentations. They are boring and they do more harm than good.
- Don't use 3D figures when 2D figures can suffice
- Never ever clutter the slides with useless data, words, figures, images, etc

3. 1 message per slide ... not more !!
- Do you understand that message ?
- Does your slide reflect that message ? You don't have to write down the message into your slide
- Can I make this slide simpler and still make sure that the message is conveyed. In most cases, this is done by replace bunch of text either with an appropriate image or animation.
- A lot of the time, entire message can be conveyed just by a quote from a reliable or important person.

4. Try to find a good figure or animation to convey a complex idea, algorithm or system. Here are some tips on creating animation:
- Using same slide background, create many slides (by copy operation). Change only 1 detail per slide. When you keep moving from one slide to next, it would give an effect of animation, something like flip-book. However, this process creates a lot of slides (which is problematic in many cases). So, if you want to fit everything into 1 slide, you can create a flash object, gif file or avi file and embed it into the slide. Most people use Adobe tools to create professional looking flash files, but here is poor man's solution:
- Use flip-book style slide transition technique and desktop recorder (eg: recordmydesktop) to create a movie. It usually creates movie in ogv file format.
- Also, lot of time you need to create animation using statistical distribution. In that case, use animation library of R and create ogv movie using desktop recorder. (This library also allows you store the animation in gif, but I am having some problems with it).

# R code for animation
library(animation)
ani.options(interval = 1, nmax = 20)
sd1 <- 41
for (i in 1:ani.options("nmax")) {
    x=seq(900, 1100,length=200)
    y=dnorm(x,mean=1000,sd=sd1)
    plot(x,y,type="l",lwd=2,col="red")
    ani.pause() ## pause for a while ('interval')
    sd1 <- sd1 - 2
}

a. Convert OGV to AVI:
mencoder input.ogv -ovc xvid -oac mp3lame -xvidencopts pass=1 -speed 0.5 -o output.avi
b. Convert OGV to GIF (or swf):
ffmpeg -i input.ogv -loop_output 0 -pix_fmt rgb24 -r 10 -s 320x240 output.gif
"-loop_output 0" means loop forever.

5. Is the content tailored to meet audience' expectation:
- Employers/future collaborators will try to evaluate you rather than the topic.
- Students (undergrad/first year grads/ ..) or people from different area want to understand the problem and intuition behind the solution. Hence, they are more interested in motivation section.
- Co-author will look for recognition.
- Sponsor will look for acknowledgement.
- People who work in same area will try to see if how your system/solution compares to theirs (i.e. related work) or may be are there to look for a new problem (i.e. your future work).

6. Other design tips:
- Use contrast (using font type, font color, font size, shape, color, images) accentuate the difference. (If it is different, make it very different - Garr Reynolds). It helps audience to identify the message of the slide quickly.
- Repetition: Use same image (or anchor) to represent the same concept throughout the presentation.
- Alignment: Use grids, invisible lines. Audience needs to know the order in which to process the information.
- Proximity
- Hierarchy, Unity
- Use high quality graphics. There are websites that create and sell them like www.istockphoto.com; or even give them for free like http://www.flickr.com/creativecommons, etc.
- Nothing distracts the audience more than a cluttered slide.
- Make sure your font size is greater than 24 point (so that audience can read it from the projector)
- Use comics ... it not only conveys the information efficiently, but also makes the presentation playful. Though many of my colleagues disagree with this, I believe most audience wants to have fun even if it is a research presentation .. so lighten the mood, be creative and "artsy" with your slides :)

"The more strikingly visual your presentation is, the more people will remember it. And more importantly they will remember you" - Paul Arden.
Here are some slides that are well designed:
http://www.slideshare.net/jbrenman
http://www.slideshare.net/chrislandry
http://www.slideshare.net/gkawasaki
http://www.slideshare.net/garr

Step 3: Rehearse and re-rehearse
- Use both notes and slides.
- Rehearse until you can remember and are comfortable with them. Imagine the worst case scenario that you lost your laptop, slides and all your notes ... and you have to give your presentation !!! ... If you can do that, you are ready :)
- Here you also do the job of editing, if you feel certain slides are just fillers and can be removed ... be brutal and delete them.
- Listen to http://www.ted.com/talks for motivation on how to present.

Step 4: Delivery
- Go to the presentation room 15 min before the scheduled time and make sure your slides are compatible, projector works, ... Also, bring a bottle of water, just in case.
- Leave the lights on and face the audience. People are there to see you, not read through the slides. Use wireless presenter so that you don't need to hide behind the podium/monitor.
- Slides are not telepromptor ... so don't keep staring at the monitor ... connect with the audience.

Remember these two quotes:
- Every presentation has potential to be great; every presentation is high stakes; and every audience deserves the absolute best - Nancy Duarte.
- You are presenting yourself rather than the topic. So, negative impact of poor presentation far exceeds failure of getting your point across ... Jean-Luc Lebrun

Wednesday, November 03, 2010

An Attempt to understand Machine Learning (Part 1)

I still remember the excitement when I took "Theory of Automata" course in my undergrad and was introduced to Turing machine. Few days after digging a little, I read Church-Turing thesis which stated "Every algorithms can be expressed in terms of Turing machine". This was like saying ... well you have framework for understand everything you will eventually learn in your Masters or PhD. However, mapping everything to Turing machine kept on becoming difficult as I started doing research and particularly when I took Machine Learning course. Most people come up with different perception when they take Machine Learning or Artificial Intelligence course.

Before I begin let me make you aware that Machine "Learning" is not same as human "learning". Humans learn something is a way that is not fully understood by scientific community. Let's say two persons X and Y see a mango tree. X has never seen a mango tree before so he will create a mental picture of it as a medium sized tree with green leaves. Y on other hand will create a mental image of tree and associate it with a mental image of mango or may be even with other senses like "it taste like mango shake". Any further discussion between X and Y will be based on their mental images and not reality. Note: information in mental images is less than information in reality and the reasons for this information loss are:
- We cannot store or process all the information in the world (due to lack of 'infinite' focus)
- Everyone filters information from reality according to their past experiences or beliefs.

Hence, since it is virtually impossible to have two person with exactly same experiences, it is impossible to have two people with exactly same mental images or information. If we treat the mental image as an output or even intermediate variable that we use to spit out an output, there is no "common" theory/algorithm that will generate a common output for all humans for same problem. Hence, this huge diversity (which makes life interesting) makes humans difficult to understand.

Most scientists (at least the neurologist/psychiatrist/psychologist) use abstraction (i.e. weed out details and rely on central theme i.e. "tree") to specify a theory/algorithm that applies to all. Yet there is another problem with human learning, humans don't learn or process with objects but with patterns. Furthermore, humans use association (and not exhaustive search through all the patterns) to interpret a situation. This process of association is somewhat random or at least extremely diverse (i.e. every person associates object A to object B using different sets of patterns and these patterns change over a period of time). Hence, it is extremely difficult to play around with such diversity using a single unifying theory. So, machine learning is not about building a machine that will always pass the "Turing test" for all the humans and in all the situations (though that is the ultimate goal of Machine Learning). (Btw, if you don't know about Turing test, read http://en.wikipedia.org/wiki/Turing_test.)

Having said that, it is also very important not to be too humble while defining what Machine Learning is, especially if you compare every machine learning situation to Turing machine. Remember, assumptions for Turing machine are (that are not necessarily applicable to all Machine Learning problems):
- Entire input is available at the beginning of computation.
- No state is maintained between execution of two (same or different) programs.
Especially interesting case where it is very difficult to define the problem in terms of Turing machine is Reinforcement Learning that we will study later. Also paper by Leeuwen and Wiedermann that discusses Turing machine in comparison to contemporary computing would be a nice read.

By now I made two key points (or mental images :)):
1. Machine Learning != Human Learning
2. Machine Learning != Turing machine

References:
1. The Turing Machine Paradigm in Contemporary Computing - Leeuwen and Wiedermann.

Sunday, June 21, 2009

Shuffle functions

Very frequency asked question is "Write a program to shuffle a set of cards". There are many approaches to solve this problem.

Almost all of them treat the set of cards as a list and write the program to shuffle the list.

One approach is to emulate the human shuffling:
Hindu Shuffle; Riffle shuffle

Other techniques are discussed in more details in wikipedia

Here is an interesting but not optimal way(O(n lgn)) to shuffle a list:
In Ocaml:

let shuffle l1 =
let sortComparator x y =
(
match (Random.int 3) with
0 -> 0
| 1 -> -1
| 2 -> 1
| _ -> failwith "Error in shuffle"
)
in
List.sort sortComparator l1
;;

C++ Code:

template < class T >
bool sortComparator(T x, T y){
if(rand % 2 == 0)
true;
else
false
}

template < class T >
vector < t > shuffle(vector < t > cards) {
sort(cards.begin(), cards.end(), sortComparator);
return cards;
}

Saturday, May 30, 2009

Research Notebook

After experimenting a bit with Evernote, I decided to move to OneNote for following reasons:
1. Evernote has no drawing tools. I like to draw and scribble a lot.
2. I get free subscription of OneNote as a student, whereas I need to pay for Evernote for some additional features like image recognition etc ... Though I don't need them right now, it would be handy to have them.
3. Onenote allows you to integrate audio and video notes, whereas Evernote is simply text based.
4. Onenote allows your notes to be arranged in hierachical folders, whereas Evernote is huge collections of notes with tags. Though I like the tagging feature (which Onenote also has), I don't think it is enough to organize my notes.

Though I think Synchronize feature in Evernote is awesome, I use Webexport + Publish as PDF feature in OneNote may be once a week. However, webediting feature in Evernote has no counterpart for OneNote, I will have to live with it until Office 2010. (I donot think LiveMesh solves this problem as I use Ubuntu on my lab machines)

I have used OneNote Web Exporter plugin to export my Research Notes for webview.

For now (until UF doesnot close my CISE account), I will maintain my notes at:
http://www.cise.ufl.edu/~npansare/webview.html and PDF at http://www.cise.ufl.edu/~npansare/ResearchNotebook.pdf

If you are using Firefox, you need to have Mozilla Archive Format add-on (since it is in MHT format). However, it still doesnot display properly. I recommend IE for webview or if you are a linux user, view the PDF.

Sunday, April 26, 2009

My bookmarks, research papers, books

I use xmarks to synchronize my bookmarks since I have both Windows and Ubuntu machines: http://share.xmarks.com/folder/bookmarks/W6kXue0nMH

I am experimenting with Zotero: https://www.zotero.org/niketanpansare/21978/items

If you have missed by previous post:
I am using citeseer for managing the papers I read: http://www.citeulike.org/user/niketan

Here is a list of books I would like to read: http://spreadsheets.google.com/pub?key=pRiihxKnNDBmNK4-OrOY0ZA

Friday, March 06, 2009

How to access the machine in CISE labs @ UF

Here is CISE's official webpage for accessing CISE public machines: http://www.cise.ufl.edu/help/access/index.shtml

Here is quick summary:

To access linux or solaris cise machines, you can

SSH using "Putty" on thunder.cise.ufl.edu (or sand, rain, bay, shine, storm, ...)

or SCP using "WinSCP"

To access Windows machine, you can remote desktop (or rdesktop) to new-mercury.cise.ufl.edu

If you have an office in CISE labs, you can access you machine using following steps:

Note: You will need a CISE account and password to the local machine. :)

You can replace 10.227.170.147 using your IP address.

To be done once: install vnc server and openssh in you session and vnc client on you home machine

Start vnc server on the machine (in CISE Lab)

1. Locally using command: vnc4server :n (where n is some number say eg. :2)

or on remote machine - these are steps

1. open putty

2. ssh to rain

3. ssh to 10.227.170.147 from rain

4. vnc4server :n (where n is some number say eg. :2)

1. open putty

2. Go to Connections > SSH > Tunnels

Source port: 5901

Destination: 10.227.170.147:5901

3. Connect to rain.cise.ufl.edu n leave the session as it is

4. Open VNC client and connect to 127.0.0.1:1

5. Enter password in the small dialog box

6. gnome-session (if you get a terminal, you can skip this step by configuring the vnc server settings)

Sunday, September 14, 2008

GatorMemo version 3.0 is ready !!!

After I decided to do PhD, I always wanted to improve my study process (especially my notes management system). After several discussions with my internship manager Raman Iyer, my friends Raja Nadar and Kishore Malani, I decided to go ahead with GatorMemo. It is build on principles of SuperMemo (spaced learning) and Mindmaps.

It started out as Windows Form application which resulted in version 2.0. I feel the user interface is too complicated for normal user, so decided to build it again from scratch using WPF instead of WinForms. Functionally, there is no difference between these two versions (except WPF doesnt have facility to play audio/video files. I might add them as soon as I get time to work on it).

Suggestions:
1. Read a book, research paper, or go through lectures/podcast/videos, etc ...
2. Then, create a mindmap after completing understanding the topic. I usually refer to 3 or more books and then organize my thoughts into mindmap. I strongly suggest to put huge amount of efforts to create a good mindmap.
3. Now, create card for each node. Front side of card can be a question about the node or simply node name and Back side can be the answer or description about the topic that the node represents.
4. I recommend adding mnemonics and bibliography to mindmap and cards. Mnemonic will help you recollect the subject matter and Bibliography will be extremely useful if you are planning to write a research paper.
5. Periodically, revise the card by using the Autolearn under Action tab (This software uses SuperMemo algorithm developed by Piotr Woznaik).
6. I also suggest using Lateral Thinking especially if you are planning to write a research papers.

Here is link to setup version 3.0: (I encourage you to to use this version)
http://niketan2.googlepages.com/GatorMemoSetup.msi

You can import following file (see Action tab):
http://niketan2.googlepages.com/OSMindMap.zip to see how s/w works :)

It is still in Beta phase, so expect bugs / exceptions :) ... (Infact I am so busy with my studies/assignments, that I did not time to any kind of corner case testing). Please mail me if you find bugs and help me improve the software.

If you are interested in improving the software, go ahead and download the code and modify it.

Friday, May 09, 2008

P versus NP

Never heard about it ?

Its an unsolved problem in computer Science. Infact, if you could find a solution to it, you will get a million dollars. Well, I am not kidding (http://www.claymath.org/millennium/).
Now aren't you curious ?

Here is the official problem description: http://www.claymath.org/millennium/P_vs_NP/Official_Problem_Description.pdf

I will try to explain it to you in more simpler way. It is extremely titillating problem because it is very simple to understand it and first time when you hear it, it seems that you can crack it witihin a day or two.

I will assume few things:You know what an algorithm is and what is worst, average and best case complexities (asymtotic notations) are. You may want to refer to Chapter 1,2 and 3 in Introduction to Algorithm by Cormen

Section 1: Lower bound
Say you are given a problem and you find an algorithm that solves that problem in O(n^2) time. What is the lower bound of your algorithm (or should I say the problem -- Hmmm, I will use algorithm and the problem interchangeably) ?
If your answer is O(n^2), you may want to read this section further, else you can skip this and go to the next section.

Lower bound is not about a particular algorithm, but about all possible algorithms for solving that problem.

I know 2 ways to find lower bound of the algorithms:

1. Decision tree based approach:
Eg: Lower bound of Comparison based Sorting (See this: http://www.ics.uci.edu/~eppstein/161/960116.html)

In short, comparison sort can be mapped to a decision tree (binary tree with yes and no). Longest path in binary tree with k leaves = log k.Since we know a number can be arranged in n! ways => n! leaves of decision trees.There, number of comparisions = log(n!) = n log(n) - O(n) .... using Stirling's formula.

2. Reduction:
Reduce an algorithm whose lower bound you already know to your algorithm (Thats right ... not the other way round).
Eg. Say you somehow prove that Sorting <= (O(n)) Ur algorithm, ie. Sorting is reducible to your algorithm in polynomial timeThen you can state that lower bound of your algorithm is nlogn.This in essense means that your algorithm is atleast as difficult as sorting. So, how to prove this ? Prove that by some crazy way, "Sorting can be done by your algorithm"

Section 2: Definitions
Decision problems: Problem with output either "YES" or "NO", depending on values of some input parameters

P: Class of decision problems that are "solvable" in polynomial time.

NP: Class of the decision problem that can be "verified" in polynomial time.
Eg: Sorting is NP since given a list, we can verify that it is sorted or not in polynomial time.Also, Sorting is P since it is solvable in polynomial time (O(nlogn)).
Infact, Every P problem is NP (ie to verify a P problem, we can simply solve it in polynomial time and compare the input with the output). i.e P belongsto NPWhether there exists any NP problem that is not P is yet unproven.
Trivial Solution - Every NP problem can be solved in 2^n, ie simply enumerating all cases.

NP-hard: For all problems L in NP, if L is reducible to L1 in polynomial time, then L1 is NP-hard.
NP-Complete: A problem is NP-Complete iff it is both NP and NP-hard.

See this Venn diagram for big picture.

co-NP: If complement of your problem is in NP, then your problem is in co-NP.
Eg: NP - Given a graph, is there a path of length less than k from some source node s to destination node t.co-NP - Given a graph, are all path of length greater than k
Again, whether NP = co-NP is unknown ?

Oracle: Consider Oracle to be GOD who knows everything and whom you can trust (So, computer scientist believe in GOD ... may be not in way as religion, but maybe yes to some extent)

We "pray" to GOD to give guidance. Similarly, we ask Oracle to help us with the problem.

GOD give us omens. Oracle gives us a certificate (an input sequence that will probably reveal the "TRUTH"). So you try to verify the certificate by your polynomial time verification algorithm (assuming it is NP).

Now here is a trick:

If the answer to that certificate is "YES" ... cool ... you have solved the problem !!!
If the answer is "NO" .... you ask for a new certificate from the Oracle.

(You could argue that since we trust Oracle, we should output "NO". But, we don't do that because Oracle gives a random certificate for undecidable problems. This makes it indeterministic, hence NP => nondeterministic polynomial).

Section 3: The real stuff
The biggest breakthrough in this area: Stephen Cook proved that Boolean Satisfiability problem (SAT) is NP-Complete.

So, if you could prove one of this things, you probably don't have to work any more for your thesis or field's medal:
1. SAT has polynomial time algorithm
2. Give non-polynomial time lower bound argument for SAT

If you prove 1 => you have proven P = NP
If you prove 2 => you have proven P != NP

OK. Let me makes things a little easy for you. Instead of SAT, you can do above things on any one of these problems

And yippee .... you will be more popular than Donald Knuth in Computer Science field.

I would encourage you to read "Computers and Intractability: A Guide to the Theory of NP-Completeness" or atleast a chapter in "Introduction to Algorithms - by Cormen, Rivest" to see how reduction actually works. Some of them are really cool and mind boggling techniques.

Monday, April 07, 2008

Random C++ tricks - GDB, Valgrind and Makefile

1. Perfect programming is hard - even the best programmers make mistakes.
2. A line of code that you haven't tested is a line of code with a bug.

Cardinal rule for debugging: If you are using gcc or g++, compile your program with -g option.
(Btw, to link a library with gcc, you use -l option. Eg: to link math library, use:
gcc filename -o outputfile -lm)

Segmentation Faults:

Any access outside that area (memory area assigned to the program) will cause a segmentation fault.

Common mistakes of segmentation faults:
1. Deferencing
a. NULL
b. an uninitialized pointer
c. a pointer that has been freed (or deleted, in C++) or that has gone out of scope (in the case of arrays declared in functions)
2. writing off the end of an array
3. Calling a recursive function that uses all of the stack space (On some systems, this will cause a "stack overflow" report)

Attacking segfaults:

1.a Deferencing NULL

$gdb myPgm core
The core file contains all the information needed by GDB to reconstruct the state of execution when the invalid operation caused a segmentation fault.

It will o/p:

Some copyright info
Core was generated by `myPgm'.
Program terminated with signal 11, Segmentation fault.
Some information about loading symbols
#0 0x0804838c in foo() () at myPgm.cpp:4
4 *x = 3;

This tells us that the execution stopped inside the function called foo() on line 4, which happened to be the assignment of the number 3 to the location pointed to by x.

(gdb) list
1 void foo()
2 {
3 char *x = 0;
4 *x = 3;
5 }
6
7 int main()
8 {
9 foo();
10 return 0;
(gdb) print x
$1 = 0x0

A little complicated eg: execution crashing inside a system call or library function (perhaps because we passed an uninitialized pointer)

#0 0x40194f93 in strcat () from /lib/tls/libc.so.6
(gdb) bt
#0 0x40194f93 in strcat () from /lib/tls/libc.so.6
#1 0x080483c9 in foo() () at t.cpp:6
#2 0x080483e3 in main () at t.cpp:11

so now we are in strcat stack but we want to see variable that was passed to strcat:

(gdb) up
#1 0x080483c9 in foo() () at t.cpp:6
6 strcat(x, "end");
(gdb) print x
$1 = 0x0

A common mistake is to not check the return from malloc to make sure that the system isn't out of memory. Another common mistake is to assume that a function that calls malloc doesn't return NULL even though it returns the result of malloc. Note that in C++, when you call new, it will throw an exception, bad_alloc, if sufficient memory cannot be allocated. Your code should be prepared to handle this situation cleanly, and if you choose to catch the exception and return NULL inside a function that ordinarily returns a new'ed pointer, this advice still holds.

1.b Deferencing an uninitialized pointer
Figuring out whether or not a pointer has been initialized is a bit harder than figuring out whether a pointer is NULL. The best way to avoid using an uninitialized pointer is to set your pointers to NULL when you declare them (or immediately initialize them).

You might need to figure out if 0x4025e800 is valid memory. One way you can get a sense of this in GDB is by printing out the addresses stored in other pointers you've allocated. If they're fairly close together, you've probably correctly allocated memory. Of course, there's no guarantee that this rule of thumb will hold on all systems.

In some cases, your debugger can tell you that an address is invalid based on the value stored in the pointer.

(gdb) print x
$1 = 0x1e <out of bounds>
(gdb) print *x
Cannot access memory at address 0x1e

1.c Dereferencing Freed Memory
This is another tricky bug to find because you're working with memory addresses that look valid. The best way to handle such a situation is again preventative: set your pointer to point to NULL as soon as you've freed it.

Another form of this bug is the problem of dealing with memory that has gone out of scope.

char *return_buffer()
{
char x[10];
strncpy(x, "a string", sizeof(x));
return x;
}

This is a really tricky bug to find because once again the memory address will look valid when you print it out in GDB.

If that pointer is causing you trouble, check the function and look for whether the pointer is pointing to a local variable in the function. Note that it is perfectly fine to return a pointer to memory allocated in the function using new or malloc, but not to return a pointer to a statically declared array (e.g., char x[10]).

2. writing off the end of an array
Valgrind will help you catch this bug on dynamically allocated arrays. See point 2 below.

Trick:
If notice that some of your variable values are changing periodically and unexpectedly. This is a tough bug to crack; one option is to set up your debugger to watch a variable for changes and run your program until the variable's value changes. Your debugger will break on that instruction, and you can poke around to figure out if that behavior is unexpected.

(gdb) watch [variable name]
Hardware watchpoint 1: [variable name]
(gdb) continue
...
Hardware watchpoint 1: [variable name]

Old value = [value1]
New value = [value2]

This approach can get tricky when you're dealing with a lot of dynamically allocated memory and it's not entirely clear what you should watch.

3. Stack overflow:
To diagnose a stack overflow in GDB, typically you just need to do a backtrace. If you find a single function call piling up an awfully large number of times, this is a good indication of a stack overflow.

(gdb) backtrace
#0 foo() () at t.cpp:5
#1 0x08048404 in foo() () at t.cpp:5
#2 0x08048404 in foo() () at t.cpp:5
#3 0x08048404 in foo() () at t.cpp:5
[...]
#20 0x08048404 in foo() () at t.cpp:5
#21 0x08048404 in foo() () at t.cpp:5
#22 0x08048404 in foo() () at t.cpp:5
---Type to continue, or q to quit---

Typically, you need to analyze your recursive function to make sure that all the base cases (the cases in which the function should not call itself) are covered correctly.

Other cool tricks with gdb:
We discussed bt, p (print), list, up

To set a breakpoint, type break <line number>
<line number> can be <sourcefile>:<line number>

Use n (or next) to progress through the loop one line at a time, r to run the program and c to continue.

There is also another set of commands, although they are not very commonly used; these commands actually modify the program as it is running. The call command, for example, will call a particular function; the set variable command will set a variable in the program to a particular value, for example, set variable i = 0. Return will make the current function return to the caller.

Using valgrind:

1. Finding memory leaks:
Compile your program with -g option.

$valgrind --tool=memcheck program_name

If number of allocs and the number of frees will differ in o/p, the try

$valgrind --tool=memcheck --leak-check=yes --show-reachable=yes program_name

O/p is somewhat like:

==pid== 100 bytes in 1 blocks are definitely lost in loss record 1 of 1
==pid== at 0x1B900DD0: malloc (vg_replace_malloc.c:131)
==pid== by 0x804840F: main (myPgm.c:5)

This means that there is a memory leak at line 5 in myPgm.c

2. Finding invalid pointers:
Warning of form:
'Invalid read of size X'or 'Invalid write of size X'
where X is the amount of memory we try to read.

Eg: if you allocate an array with malloc or new and then try to access a location past the end of the array:

int main()
{
char *x = malloc(10);
x[10] = 'a';
return 0;
}

$valgrind --tool=memcheck --leak-check=yes --show-reachable=yes program_name

This will give o/p:

==pid== Invalid write of size 1
==pid== at 0x804841E: main (myPgm2.c:6)
==pid== Address 0x1BA3607A is 0 bytes after a block of size 10 alloc'd
==pid== at 0x1B900DD0: malloc (vg_replace_malloc.c:131)
==pid== by 0x804840F: main (myPgm2.c:5)

3. Detecting the use of uninitialized variables:
Warning: 'uninitialised' value(s)

Eg:

int foo(int x)
{
if(x < 10)
{
printf("x is less than 10\n");
}
}

int main()
{
int y;
foo(y);
}

o/p:

==pid== Conditional jump or move depends on uninitialised value(s)
==pid== at 0x8048366: foo (myPgm3.c:5)
==pid== by 0x8048394: main (myPgm3.c:14)

4. Other improper use of memory:

4.a: If you call free twice on the same pointer value, Valgrind will detect this for you; you'll get an error:

Invalid free()

This is also called "double free attack":

free(x);
/* code */
free(x);

The easiest way to avoid it is simply to set your pointer to point to NULL once you've freed it:

free(x); x = NULL;
/* code */
free(x);

4.b: free should be matched with corresponding malloc. Similarly, delete with new and delete[] with new[].
If you do mismatch, you will get following error:

Mismatched free() / delete / delete []

What won't Valgrind find?
Valgrind doesn't perform bounds checking on static arrays (allocated on the stack).
Eg:

int main()
{
char x[10];
x[11] = 'a';
}

Valgrind won't alert you!

Also, if you don't test for buffer overflows by using long input strings, Valgrind won't tell you that your code is capable of writing over memory that it shouldn't be touching.

------------------------

Some common bugs in C/C++:
1. Uninitialized variables
2. Using a single equal sign to check equality:
while(x='Y') { ....}
3. Extra Semicolons:

int x;
for(x=0; x<100; x++);
Do something with x;

4. Overstepping array boundaries (especially if it static array):

int array[10];
//...
for(int x=1; x<=10; x++) do something with array[x];

-------------------------
Make file: Simple eg:

CC = gcc
FILES = in_one.c in_two.c
OUT_EXE = out_executable

build: $(FILES)
$(CC) -o $(OUT_EXE) $(FILES)

clean:
rm -f *.o core

rebuild: clean build

Advanced eg:
1. $@ is the name of the target.

client: client.c
$(CC) client.c -o $@

server: server.c
$(CC) server.c -o $@

List of dependents
$? - more recent than the target(i.e., those that have changed since the last time make was invoked for the given target)
$^ - all dependencies, regardless of whether they are more recent than the target, but removes duplicate names
$+ - like $^, but it keeps duplicates
$> - first dependency

Wildcard Matching in Targets (percent sign, %)
When a % appears in the dependencies list, it replaces the same string of text throughout the command in makefile target. If you wish to use the matched text in the target itself, use the special variable $*.
Eg:

%.c:
gcc -o $* $*.c

Implicit Targets

.c:
$(CC) -o $@ $@.c

This rule says that for any target that corresponds to a .c file, make should compile it using the name of the implicit target as the output, and the name plus the .c extension as the file to compile.

Replacing Text
To specify that OBJ is equivalent to SRC, except with the .c extension replaced with a .o extension:

OBJ = $(SRC:.c=.o)

Visit CProgramming.com for more details