Niketan Pansare's blog: Programming

Showing posts with label Programming. Show all posts

Thursday, November 30, 2017

How to debug the unicode error in Py4J

Many Java-based big data systems such as SystemML, Spark's MLLib, CaffeOnSpark, etc that use the Py4J bridge in their Python APIs often throw a cryptic error instead of detailed Java stacktrace:

py4j.protocol.Py4JJavaError: < exception str() failed >

For more details on this issue, please refer to Python's unicode documentation.

Luckily, there is a very simple hack to extract the stack trace in such a situation. Let's assume that SystemML's MLContext is throwing the above Py4J error: ml.execute(script). In this case, simply wrap that code in the following try/except block:

Monday, October 16, 2017

Git notes for SystemML committers

Replace niketanpansare below with your git username:

1. Fork and then Clone the repository

https://github.com/niketanpansare/systemml.git

2. Update the configuration file: systemml/.git/config

[core]
symlinks = false
repositoryformatversion = 0
filemode = false
logallrefupdates = true
[remote "origin"]
url = https://github.com/niketanpansare/systemml.git
fetch = +refs/heads/*:refs/remotes/origin/*
[branch "master"]
remote = origin
merge = refs/heads/master
[remote "upstream"]
url = https://github.com/apache/systemml.git
fetch = +refs/heads/*:refs/remotes/upstream/*
[remote "apache"]
url = https://gitbox.apache.org/repos/asf/systemml.git
fetch = +refs/heads/*:refs/remotes/apache/*

The above configuration states that: "origin" points to your fork and "upstream" points to the github mirror of the main repository, which is referred to as "apache".

Optional but recommended: Update the ~/.gitconfig file:

[user]
name = Niketan Pansare
email = npansar@us.ibm.com
[alias]
systemml-pr = "!f() { git fetch ${2:-upstream} pull/$1/head:pr-$1 && git checkout pr-$1; }; f"

3. Now, let's you created a branch 'foo' with 3 commits. You can create a PR via github website

4. Once the PR is approved and if you are the committer, you can merge the PR:

git checkout foo
git reset --soft HEAD~3 && git commit
git pull --rebase apache master
git checkout master
git pull apache master
git merge foo
gitk
git log apache/master..
git commit --amend --date="$(date +%s)"
git push apache master

Let's discuss the above commands in more detail:
- git checkout foo: ensures that you are working on the right branch.
- git reset --soft HEAD~3 && git commit: squashes all your commits into one single commit in preparation for the merge.
- git pull --rebase apache master: rebases your local master branch with the apache.
- git checkout master: switch to the local master branch
- git merge foo: merge the branch foo with local master
- gitk: to double-check visually if local and remote branches are all in synch and if the commit is not creating weird loops.

- git log apache/master..: double-check if the log makes sense and if you have added "Closes #PR." at the end of the log and correct JIRA number as the header.
- git commit --amend --date="$(date +%s)": if not, allows you to modify your log message.
- git push apache master: go ahead and push your commit to apache :)

5. If you are a committer and want to merge a PR (let's assume PR 100), you need to create a local branch using the following command:

git systemml-pr 100
git checkout pr-100

Also, if you had to squash the commits as described above, please use the below command to amend the author:

git commit --amend --author="FirstName LastName <contributor@blah.com>"

6. If you have modified the docs/ folder, you can update the documentation:

git subtree push --prefix docs apache gh-pages

Before, updating the documentation, you can sanity test on your branch by:

git subtree push --prefix docs origin gh-pages

If the above command fails, you can delete the gh-pages branch and re-execute the command:

git push origin --delete gh-pages

Advanced commands:
1. Cherry-picking:

git log --pretty=format:"%h - %an, %ar : %s" | grep Niketan
git checkout gpu
git cherry-pick

Notes from other committers:
Deron: https://gist.github.com/deroneriksson/e0d6d0634f3388f0df5e#integrate-pull-request-with-one-commit-into-apache-master
Mike: https://gist.github.com/dusenberrymw/78eb31b101c1b1b236e5

Tuesday, February 17, 2015

Plotting as a useful debugging tool

While writing the code for bayesian modeling, you will have to test the distribution (prior, likelihood or posterior). Here are two common scenarios that you might encounter:

You want to test the function that generates random deviates. For example: you have derived a conjugate formula for a parameter of your model and want to test whether it is correct or not.
You want to test a probability density function. For example: likelihood or posterior function that you might want to run rejection sampler on.

In both these cases, you will start by making sure the property of the distribution (for example: range, mean, variance) are correct. For example: if the parameter you are sampling is variance, then you will have “assert(returnedVariance > 0)” in your code. Then, the next obvious test should be visual inspection (trust me, it has helped me catch more bugs than I would by traditional programming debugging techniques/tools). This means you will plot the distribution and see if the output of your code makes sense.
We will start by simplifying the above two cases by assuming standard normal distribution. So, in the first case, we have access to “rnorm” function and in second case, we have access to “dnorm” function of R.
Case 1: In this case, we first collect random deviates (in “vals”) and then use ggplot to plot them:

library(ggplot2)
vals = rnorm(10000, mean=0, sd=1)
df = data.frame(xVals=vals)
ggplot(df, aes(x=xVals)) + geom_density()

The output of above R script will look something like this:

Case 2: In this case, we have to assume a bounding box and sample inside that to get “x_vals” and “y_vals” (just like rejection sampling):

library(ggplot2)
x_vals=runif(10000,min=-4,max=4)
y_vals=dnorm(x_vals, mean=0, sd=1)
df = data.frame(xVals=x_vals, yVals=y_vals)
ggplot(df, aes(x=xVals, y=yVals)) + geom_line()

The output of above R script will look something like this:

Just as a teaser to a post that I will post later, we can use the script somewhat similar to that of case 1 to study the characteristics of a distribution:

How to setup “passwordless ssh” on Amazon EC2 cluster

Often for running distributed applications, you may want to setup a new cluster or tweak an existing one (running on Amazon EC2) to support passwordless ssh. For creating a new cluster from scratch, there are lot of cluster management tools (which is beyond the scope of this blogpost). However, if all you want to do is setup “passwordless ssh” between nodes, then this post might be worth your read.
The script below assumes that you have completed following three steps:

Step 1. Created RSA public keypair on each of the machine:

cd ~
ssh-keygen -t rsa

Do not enter any paraphrase, instead just press [enter].

Step 2. Suppressed warning flags in ssh-config file:

sudo vim /etc/ssh/ssh_config
StrictHostKeyChecking no
UserKnownHostsFile=/dev/null

Step 3. Copied the key pair file “MyKeyPair.pem” to master’s home directory:

scp -i /local-path/MyKeyPair.pem /local-path/MyKeyPair.pem ubuntu@ec2-master-public-address.compute-1.amazonaws.com:~

Assuming that above three steps have been completed, run this script on the master to enable passwordless ssh between master-slave and/or slave-slave nodes:



    #!/bin/bash
    # Author: Niketan R. Pansare
     
    # Password-less ssh
    # Make sure you have transferred your key-pair to master
    if [ ! -f ~/.ssh/id_rsa.pub ]; then
      echo "Expects ~/.ssh/id_rsa.pub to be created. Run ssh-keygen -t rsa from home directory"
      exit
    fi
     
    if [ ! -f ~/MyKeyPair.pem ]; then
      echo "For enabling password-less ssh, transfer MyKeyPair.pem to master's home folder (e.g.: scp -i /local-path/MyKeyPair.pem /local-path/MyKeyPair.pem ubuntu@master_public_dns:~)"
      exit
    fi
     
    echo "Provide following ip-addresses"
    echo -n -e "${green}Public${endColor} dns address of master:"
    read MASTER_IP
    echo ""
     
    # Assumption here is that you want to create a small cluster ~ 10 nodes
    echo -n -e "${green}Public${endColor} dns addresses of slaves (separated by space):"
    read SLAVE_IPS
    echo "" 
     
    echo -n -e "Do you want to enable password-less ssh between ${green}master-slaves${endColor} (y/n):"
    read ENABLE_PASSWORDLESS_SSH
    echo ""
    if [ "$ENABLE_PASSWORDLESS_SSH" == "y" ]; then
      # Copy master's public key to itself
      #cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
     
      for SLAVE_IP in $SLAVE_IPS
      do
          echo "Checking passwordless ssh between master -> "$SLAVE_IP
        ssh -o PasswordAuthentication=no $SLAVE_IP /bin/true
        IS_PASSWORD_LESS="n"
        if [ $? -eq 0 ]; then
            echo "Passwordless ssh has been setup between master -> "$SLAVE_IP
            echo "Now checking passwordless ssh between "$SLAVE_IP" -> master"
     
          ssh $SLAVE_IP 'ssh -o PasswordAuthentication=no' $MASTER_IP '/bin/true'  
          if [ $? -eq 0 ]; then
              echo "Passwordless ssh has been setup between "$SLAVE_IP" -> master"
            IS_PASSWORD_LESS="y"
          fi
        fi
     
        if [ "$IS_PASSWORD_LESS" == "n" ]; then
          # ssh-copy-id gave me lot of issues, so will use below commands instead
          echo "Enabling passwordless ssh between master and "$SLAVE_IP
     
          # Copy master's public key to slave
          cat ~/.ssh/id_rsa.pub | ssh -i ~/MyKeyPair.pem "ubuntu@"$SLAVE_IP 'mkdir -p ~/.ssh ; cat >> ~/.ssh/authorized_keys' 
          # Copy slave's public key to master
          ssh -i ~/MyKeyPair.pem "ubuntu@"$SLAVE_IP 'cat ~/.ssh/id_rsa.pub' >> ~/.ssh/authorized_keys
          # Copy slave's public key to itself
          ssh -i ~/MyKeyPair.pem "ubuntu@"$SLAVE_IP 'cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys'
     
        fi    
      done
     
      echo ""
      echo "---------------------------------------------"
      echo "Testing password-less ssh on master -> slave"
      for SLAVE_IP in $SLAVE_IPS
      do
        ssh "ubuntu@"$SLAVE_IP  uname -a
      done
     
      echo ""
      echo "Testing password-less ssh on slave -> master"
      for SLAVE_IP in $SLAVE_IPS
      do
        ssh "ubuntu@"$SLAVE_IP 'ssh ' $MASTER_IP 'uname -a'
      done
      echo "---------------------------------------------"
      echo "Sorry, prefer to keep this check manual to avoid headache in Hadoop or any other distributed program."
      echo -n -e "Do you see error or something fishy in above block (y/n):"
      read IS_ERROR1
      echo ""
      if [ "$IS_ERROR1" == "y" ]; then
        echo "I am sorry to hear this script didn't work for you :("
        echo "Hint1: Its quite possible, slave doesnot contain ~/MyKeyPair.pem"
        echo "Hint2: sudo vim /etc/ssh/ssh_config and add StrictHostKeyChecking no and UserKnownHostsFile=/dev/null to it"
        exit
      fi
    fi
     
    echo -n -e "Do you want to enable password-less ssh between ${green}slave-slave${endColor} (y/n):"
    read ENABLE_PASSWORDLESS_SSH1
    echo ""
    if [ "$ENABLE_PASSWORDLESS_SSH1" == "y" ]; then
      if [ "$ENABLE_PASSWORDLESS_SSH" == "n" ]; then
        echo -n -e "In this part, the key assumption is that password-less ssh between ${green}master-slave${endColor} is enabled. Do you still want to continue (y/n):"
        read ANS1
        if [ "$ANS1" == "n" ]; then 
          exit
        fi
        echo ""
     
      fi
      for SLAVE_IP1 in $SLAVE_IPS
      do
        for SLAVE_IP2 in $SLAVE_IPS
        do
          if [ "$SLAVE_IP1" != "$SLAVE_IP2" ]; then
            # Checking assumes passwordless ssh has already been setup between master and slaves
              echo "[Warning:] Skipping checking passwordless ssh between "$SLAVE_IP1" -> "$SLAVE_IP2
            IS_PASSWORDLESS_SSH_BETWEEN_SLAVE_SET="n"
     
            # This will be true because ssh $SLAVE_IP1 is true
            #ssh $SLAVE_IP1 ssh -o PasswordAuthentication=no $SLAVE_IP2 /bin/true
            #if [ $? -eq 0 ]; then
     
     
            if [ "$IS_PASSWORDLESS_SSH_BETWEEN_SLAVE_SET" == "n" ]; then
              echo "Enabling passwordless ssh between "$SLAVE_IP1" and "$SLAVE_IP2
              # Note you are on master now, which we assume to have 
              ssh -i ~/MyKeyPair.pem $SLAVE_IP1 'cat ~/.ssh/id_rsa.pub' | ssh -i ~/MyKeyPair.pem $SLAVE_IP2 'cat >> ~/.ssh/authorized_keys' 
            else
              echo "Passwordless ssh has been setup between "$SLAVE_IP1" -> "$SLAVE_IP2
            fi
          fi
        done
     
      done
     
      echo "---------------------------------------------"
      echo "Testing password-less ssh on slave  slave"
      for SLAVE_IP1 in $SLAVE_IPS
      do
        for SLAVE_IP2 in $SLAVE_IPS
        do
          # Also, test password-less ssh on the current slave machine
          ssh $SLAVE_IP1 'ssh ' $SLAVE_IP2 'uname -a'
        done
      done
      echo "---------------------------------------------"
      echo "Sorry, prefer to keep this check manual to avoid headache in Hadoop or any other distributed program."
      echo -n -e "Do you see error or something fishy in above block (y/n):"
      read IS_ERROR1
      echo ""
      if [ "$IS_ERROR1" == "y" ]; then
        echo "I am sorry to hear this script didn't work for you :("
        echo "Hint1: Its quite possible, slave doesnot contain ~/MyKeyPair.pem"
        echo "Hint2: sudo vim /etc/ssh/ssh_config and add StrictHostKeyChecking no and UserKnownHostsFile=/dev/null to it"
        exit
      fi
    fi

Here is a sample output obtained by running the above script (src code available via link):

ubuntu@ip-XXX:~$ ./enablePasswordlessSSH.sh
Provide following ip-addresses

Public dns address of master:ec2-XXX-29.compute-1.amazonaws.com

Public dns addresses of slaves (separated by space):ec2-XXX-191.compute-1.amazonaws.com ec2-XXX-240.compute-1.amazonaws.com ec2-XXX-215.compute-1.amazonaws.com ec2-XXX-192.compute-1.amazonaws.com ec2-XXX-197.compute-1.amazonaws.com

Do you want to enable password-less ssh between master-slaves (y/n):y

Checking passwordless ssh between master -> ec2-XXX-YYY.compute-1.amazonaws.com
bunch of warning and possibly few "Permission denied (publickey)." (Ignore this !!!)

---------------------------------------------
Testing password-less ssh on master -> slave
Don't ignore any error here !!!
---------------------------------------------
Sorry, prefer to keep this check manual to avoid headache in Hadoop or any other distributed program.
Do you see error or something fishy in above block (y/n):n

Do you want to enable password-less ssh between slave-slave (y/n):y

---------------------------------------------
Testing password-less ssh on slave <-> slave
Don't ignore any error here !!!
---------------------------------------------

Sorry, prefer to keep this check manual to avoid headache in Hadoop or any other distributed program.
Do you see error or something fishy in above block (y/n):n

Warning: The above code does not check for passwordless ssh for slave-slave configuration and sets it blindly even if passwordless ssh is already enabled. It might not be big deal if you call this script few times but won’t be ideal if the cluster needs to dynamically modified over and over again. Still, to modify this behavior, look at the line: IS_PASSWORDLESS_SSH_BETWEEN_SLAVE_SET=”n”

Saturday, October 12, 2013

Notes on entrepreneurship (Part 2)

In the first class, we divided ourselves into random groups, each with four people (mostly from different background). The idea was to pitch an startup idea within 10 minutes based on the means of our group members. This exercise was repeated again with different set of people. One thing I noticed after this exercise was even though it seems like nice idea to start with your means and build a startup based on that, people usually revert back to the idea that they are absolutely passionate about, irrespective of their means [8].

As an assignment that week, we were supposed to take $5 and in span of two hours make as much money as we can. My group decided to go with "Grocery delivery service for professors" and each of us were supposed to ask professors in our department to help with that. I got to speak to only few professors in Computer Science department (as it was Friday) and only Swarat was on-board with the idea. My team-mates were unable to get any professors from their respective department, so they decided to go with another idea: "Personalized cards and their delivery". We made $10 in tips with the first idea and ~$13 with the second idea. The lesson from this exercise was sales is hard, but probably the important part of a startup [9].

In the next class, each of us gave an elevator pitch of their idea for a startup. Let me put my idea with respect to previous post:
1. Means:
- What do I know ? Background in software development, research experience in building large-scale systems and knowledge of embedded systems.
- What do I have ? Very low capital ... student salary :(
- Whom do I know ? Software professionals in India and US (from my bachelors, masters and job experience), Marketing/Advertising professionals in India (my father's advertising firm and my MBA friends), Trustworthy partner in India (my best friend and brother Rohan), Research scientists (my advisor, my colleagues at Rice university [3], my collaborators, contacts from internships and also from my experience as President of Rice Computer Science Graduate Student Association). Other than couple of exceptions, until now I have been lucky enough to be surrounded by really nice people, which is why I believe this to be my strongest means [4].

2. Ideation:
In the previous blogpost, I vented my frustration over sexual assaults cases in India as well as provided high-level suggestions (which I must admit I had no control over). So, I decided to use my means to develop something that might help improve the situation (in whatever little way possible).
Unlike US, the commonly-accepted safety net in India is not government/police, but the social structure (i.e. your friends and family). Many of the personal safety solutions like pepper spray, taser or guns require state licenses in many countries and are even prohibited in few countries like Canada, China, Bangladesh, Singapore, Belgium, Denmark, Finland, Greece, Hungary, etc. Not to mention they are expensive and can often escalate the situation. This is why many people follow a self-imposed curfew, that is either not leave home after dark or have someone accompany you.
With advent of smartphones, an obvious solution seems to be "have a personal safety app". Most of these apps have cool features such as sending GPS location to your friends and family but have a major flaw that make them non-functional: they require you to take out your phone, open the app and press an SOS button in it. Clearly, this is not what an average person would do in times of danger. Here are some of the feedback/comments on such apps [1]:
- I think a "dangerous environment" is the last place I would want an expensive item like an iPhone prominently displayed.
- ... app (might not be) readily available to access, on screen #8 somewhere on the 2nd or third row ...
- Pulling out your iPhone in the face of an attacker? That's one sure way to escalate the situation. And he now knows you have a nice phone too.
Reality check: Not only I am passionate about this problem, there does not exists a solution that works satisfactorily (i.e. "pain points") ... to me, they are just cosmetic app, the ones you buy for keepsake but does not serve any purpose to the society.

Before discussing next point, let me take a small detour and tell you what metric I use to determine the success of this startup [2]: At the end of the course (or may be few months past that), I am able to make an app that works seamlessly in real-life situation and which I can recommend to my loved ones without any hesitancy.

3. Opportunity evaluation:
Like many computer programmers, I have an habit of developing software that do exactly what I want but not what my audience would need. To ensure that it doesn't happen, I sent out an survey asking people what they would like in during times of emergency. The demographics of participants were as follows:
- 53% males and 47% females
- 82% of participants were between 20-30 years old and 14% between 30-40 years old.
- 59% from US, 16% from India and 8% from China
- 24% had Bachelors degree, 47% had Masters degree, 27% were PhD students.
- 32% earned between $10K-50K, 28% earned between $50K-100K, 16% earned above $100K and 20% were not currently earning.

Here is the summary of responses:

The above figures shows that majority of people (46%) wanted an external device like bluetooth-clicker that victim can press when he/she is in danger. This was a good indicator that I should go ahead and spend some time building such an app.

The next step was to understand what features a user would want/need in times of emergency:

Since some of these features require backend services (for sending emails/SMS, managing account/app) which need capital (I am not rich enough to maintain this kind of service on my own), I asked how much people are willing to pay for this kind of service. It was clear that most people preferred buy-once kind of a model:

Now, the checklist of opportunity evaluation:
a. Unique value proposition of my idea: "Affordable" and "hassle-free" way to connect to your friends, family and authorities in times of emergency ... with just a click of button.
b. Is it defensible: Nope, anyone with strong programming experience can replicate the features of my app. In fact, in long run, that is exactly what I want ... lot of good apps and competition that eventually help reduce the number of sexual assaults and make the world little safer place.
c. Is it profitable/sustainable: I really don't know :( ... It could be a product like dropbox, which people never thought they would want ... but once they got it, they can't imagine their life without it ... or it could be a total flop. The only way to be completely sure is by implementing it :)
d. Clearly defined customer: Young women traveling abroad or working late, senior citizens and frequent travelers.
e. Is it feasible and scalable: Yes, I used my experience in building large-scale cost-effective systems to build the backend. Also, my experience in C/C++ programming, knowledge of design patterns and user-friendly Apple documentation helped me: (a) learn the basics of Objective C in less than a week and (b) build the version 1.0 of the iOS app in less than a month.

Finally, I must reiterate the core principle of opportunity evaluation for a startup: It is not possible to know a priori whether an idea will turn out to be good business or not. So, I decided to stick to "affordable loss" principle and develop the app with as low cost as I could. Few of the hacks I used to ensure "affordable loss" are as follows:
- Moving most of the computing to clients rather than server (so as not to buy overly expensive servers).
- Using pay-as-you-use services wherever I felt absolutely necessary (for example: Amazon web-services).
- Using GIMP to develop my own logo (which I had to learn btw :P) rather than hiring a designer [6].
- Using wordpress for the startup website rather than spending days perfecting the CSS to make it mobile-compatible or hiring a web-developer.
- Focusing on minimum viable product (using Texas Instrument's SensorTag) rather than prematurely buying tons of bluetooth clicker from China [12].
- Using in-app purchases rather than setting up credit-card system in my website to provide features which other services charge me (for example: SMS/Email) [10].
- Not running after patents early on in the venture [5].
- Buying readily-made icons set rather than designing them yourselves [7].
- Choosing a hosting plan that has no hidden fees and that supports your choice of backend services.
- Using gmail as support email (rather than one provided by hosting services) and adding feedback button in app as well as contact form on the website (with some kind of captcha [13]). One tip I have for new developers is try to minimize the number of clicks/typing in app for sending feedbacks, for example: pre-fill "to-" address as well as "subject line".
- Utilizing the membership benefits of Apple/Google development program, i.e. off-loading testing [11], advertising/cross-promotion, expert feedback, reliable delivery of your software and version management.

References / footnotes:
[1] http://reviews.cnet.com/8301-19512_7-57452393-233/iwitness-app-claims-to-be-ultimate-deterrent-to-crime/
[2] Rather than use metrics like expected profit/revenue after first year or public offering or something on similar lines.
[3] I already have got 4 other PhD students from Rice university on-board to develop Android/Windows version of the app. This is in lines with another principle of Effectual Entrepreneurship: Form partnerships.
[4] The intent of this statement is not flattery, just an observation. Here is why I think so: though I never deliberately tried to enforce it, my circle of influence consists of three types of people: those who are genuinely nice, those who are smarter than me, and non-self-destructive people. Of course, the categories are not mutually exclusive and in fact people who are smarter are usually genuinely nice (probably because they only focus on self-upliftment, not on pulling others down). Here is a layman example most people can relate to: even in course with relative grading, a smart person knows that he/she is a student of global class and is not be threatened by his/her class-mates' progress ... which is probably why you will rarely find such a student shying away from group discussion (so as to gain advantage) or deliberately spreading false information to sabotage other's grades. May be things might not be such black-and-white in other fields, which is one of the reason why I absolutely love research and development.
[5] A patent attorney who attended our class thought one of the feature of our app is patentable, which might be true. But there are 2 problems with going that way: (a) the real cost of patent is not in getting it, but in defending it, (b) It will limit the features that other programmers who are smarter than me can introduce in their personal safety app (which goes against my success criteria for this particular problem). To be completely honest, there are ways around low-capital issue for those startups who really want to get a patent: (1) file provisional patent yourself under $200 (just read about how to define the scope of your patent and also about court dates), (2) ask your parent organization or angel investor to file a patent for you in exchange for royalties or stake in the startup (for example: Rice university's OTT office), (3) contact a patent troll to defend you patent.
[6] For people with little more budget, there are websites like http://99designs.com/ and https://www.elance.com/ that allows you to hire a free-lance designers/developers. A similar website for building mock prototypes of your products to show to investor is http://www.protomold.com/.
[7] There is always a tradeoff between time and money, you just have to figure out what is your exchange rate for time ;) ... for example: if someone is providing you service that will save you 1 hour, how much are you willing to pay for that service.
[8] Whether building your startup "based on your existing means" is better than "based on your passion and then expanding your means as you go" or vice versa, I really don't know. There are obvious advantages for both and also obvious disadvantages when pushing the respective principle to extreme.
[9] Even though you might think an idea is pretty good (and will benefit the customer), people are not willing to pay as much as you think ... probably because it's either suspicion that you are trying to dupe them or incorrect valuation of the product/market from your end or something else.
[10] Though it might seem simple to just plug-in existing credit-card library and setup a php page with mysql backend, things get a little complicated when you start thinking about transactional semantics and the fact that people can buy new devices or deleting the app and similar situations.
[11] In traditional company, a developer would be evaluated based on stupid metrics, such as bugs assigned, solved, etc. Though on paper they seem apt, they can have harmful side-effects for startup such as spending too much time perfecting a feature without any customer validation/feedback. So, instead of spending significant resources on testing, submit your app to Apple as and when you add new feature and indirectly ask them to test it :) ... Other way, for cheap testing is by using crowd-sourcing websites such as Amazon's mechanical turk (which I will ignore for this post).
[12] Since the SensorTag took a while to be delivered to my home address (thanks to my apt complex rejecting the package), I decided to use an accessory that I already owned for version 1.0 (i.e. headphones) and introduce bluetooth feature in the next version.
[13] There are lot of wordpress plugins that allow you to add captcha in your website in just few minutes.

Thursday, February 02, 2012

Cross-validation on SVM-Light

Before reading this blogpost you should read: Practical guide to libSVM, Wikipedia entry on SVM, Wikipedia entry on cross-validation, and PRML - Bishop's chapter on SVM. In this post, I will not explain how to use open source SVM tools (like libsvm or svmlight), but would rather assume that you have used them and want some clever script to automate them. Also note that, the below script uses default parameters. So if you get poor results, you can add an outer for loop that iterates through your choice of parameters and supplies them to the program ./svm_learn.

To perform cross-validation on libsvm:

Download and extract it
You need gnuplot installed on the machine you want to run libsvm. If you want to run it on the server, do 'ssh -X myuserName@myMachineName'
make
Go to Tools folder and run 'python easy.py my_training_file my_test_file'

To perform cross-validation on svmlight:

Download and extract it. There you will find two important binaries: svm_learn and svm_classify
Unfortunately the data gathering using svmlight is not easy, so I have coded following script.

#!/bin/bash
# Usage: svm_light_cross_validation full_data num_train_items num_test_items k_cycles results_file
RESULT_FILE=$5

echo "Running SVM-Light via cross validation on" $1 "by using" $2 "training items and" $3 "test items (Total number of cross-validation cycles:" $4 > $RESULT_FILE

MODEL_FILE="model."$RANDOM".txt"
TEMP_FILE="tempFile."$RANDOM".txt"
PRED_FILE="prediction."$RANDOM".txt"
DATA_FILE=$1
NUM_TRAIN=$2
NUM_TEST=$3
NUM_CYCLES=$4

TEMP_DATA_FILE=$DATA_FILE"."$RANDOM".temp"
TRAIN_FILE=$TEMP_DATA_FILE".train"
TEST_FILE=$TEMP_DATA_FILE".test"

TEMP_RESULT=$RESULT_FILE".temp"
SVM_PATTERN='s/Accuracy on test set: \([0-9]*.[0-9]*\)% ([0-9]* correct, [0-9]* incorrect, [0-9]* total)\.*/'
for k in `seq 1 $NUM_CYCLES`
do
 sort -R $DATA_FILE > $TEMP_DATA_FILE
 head -n $NUM_TRAIN $TEMP_DATA_FILE > $TRAIN_FILE
 tail -n $NUM_TEST $TEMP_DATA_FILE > $TEST_FILE
 
 echo "------------------------------------------"  >> $RESULT_FILE
 echo "Cross-validation cycle:" $k >> $RESULT_FILE
 
 # first run svm with default parameters
 echo ""  >> $RESULT_FILE
 echo "Polynomial SVM with default parameters" >> $RESULT_FILE
 for i in 1 2 3 4 5 6 7 8 9 10
 do
  echo "order:" $i >> $RESULT_FILE
  ./svm_learn -t 1 -d $i $TRAIN_FILE $MODEL_FILE > $TEMP_FILE
  ./svm_classify -v 1 $TEST_FILE $MODEL_FILE $PRED_FILE > $TEMP_RESULT
  cat $TEMP_RESULT >> $RESULT_FILE
  sed '/^Reading model/d' $TEMP_RESULT > $TEMP_RESULT"1"
  sed '/^Precision/d' $TEMP_RESULT"1" > $TEMP_RESULT
  sed "$SVM_PATTERN$k poly $i \1/g" $TEMP_RESULT >> "better"$RESULT_FILE
 done

 echo ""  >> $RESULT_FILE
 echo "RBF SVM with default parameters" >> $RESULT_FILE
 for g in 0.00001 0.0001 0.001 0.1 1 2 3 5 10 20 50 100 200 500 1000
 do
  echo "gamma:" $g >> $RESULT_FILE
  ./svm_learn -t 2 -g $g $TRAIN_FILE $MODEL_FILE > $TEMP_FILE
  ./svm_classify -v 1 $TEST_FILE $MODEL_FILE $PRED_FILE >> $TEMP_FILE
  cat $TEMP_RESULT >> $RESULT_FILE
  sed '/^Reading model/d' $TEMP_RESULT > $TEMP_RESULT"1"
  sed '/^Precision/d' $TEMP_RESULT"1" > $TEMP_RESULT
  sed "$SVM_PATTERN$k rbf $g \1/g" $TEMP_RESULT >> "better"$RESULT_FILE
 done

done

rm $MODEL_FILE $TEMP_FILE $PRED_FILE $TEMP_DATA_FILE $TEMP_RESULT $TEMP_RESULT"1"
echo "Done." >> $RESULT_FILE

Here is how you run this script:

./svm_light_cross_validation myDataInSVMLightFormat.txt 5000 2000 10 MyResults.txt

myDataInSVMLightFormat.txt looks like "binaryClassifier(-1 or 1) featureNum:value ....". Example:

1 1:1.75814e-14 2:1.74821e-05 3:1.37931e-08 4:1.25827e-14 5:4.09717e-05 6:1.28084e-09 7:2.80137e-22 8:2.17821e-24 9:0.00600121 10:0.002669 11:1.19775e-05 12:8.24607e-15 13:8.36358e-10 14:0.0532899 15:3.25028e-06 16:0.148439 17:9.76215e-08 18:0.00148927 19:3.69801e-16 20:4.20283e-16 21:7.9941e-05 22:5.7593e-09 23:0.000251052 24:0.000184218 25:7.07359e-06

After running this script you will get two important files: MyResults.txt and betterMyResults.txt (which is a space separated file with format: 'validation_cycle_num [poly or rbf] parameterValue accuracy'). You can then import betterMyResults.txt in R or Matlab and get some pretty graphs :)

Wednesday, February 01, 2012

Using VIM for "fast" data transformation

Recently, I had to give my data to a lab-mate. I had done some pre-processing on 20 Newsgroup dataset and converted it to following format:
- Each line represented a document.
- Words were replaced by an integer that denoted it's index in vocabulary.txt file
- The line was append with a binary classifier (1 or -1) based on whether the document was in "comp" newsgroup or "sci" newsgroup.
- The words were represented along with their frequency.
For example, let's assume for now the word "technology" is 1002th word in the vocabulary. Then, if the word "technology" appeared 10 times in 12th document (in the folder comp.graphics), then 12th line in my file will be:
1 1002:10 ...

Here is a sample line in the file:

1 3798:1 9450:1 12429:1 13352:1 14155:1 15858:1 22319:1 29652:1 31220:1 34466:2 35883:1 37734:1 40188:1 40683:1
1 90:1 1078:1 2101:2 3775:1 5183:2 5195:1 5747:1 7908:1 7963:1 9748:1 11294:3 14879:1 16006:2 18188:1 19742:1 20928:1 21321:1 21935:1 23613:1 25354:2 26721:1 29652:1 30407:1 34054:1 36546:2 38252:1 39376:2 40204:1

Now, I had to convert this to following format:
fileNum | wordIndex | frequency
For example:

3|3798|1
3|9450|1
3|12429|1
3|13352|1
3|14155|1

I used following VIM tricks:
1. First delete the binary classifiers (and also blank spaces at the end of the line):

:%s/^-1//g
:%s/^1//g
:%s/ $//g

2. Then, comes the most important part: Insert line number before every word.
To do this, I used following trick: - First replace spaces by some arbitrary character, in my case '='
- Then, replace that character by the line number !!!

:%s/ / =/g
:g/=/exec "s/=/ ".line(".")."|/g"

The last command needs some explanation:

It says, execute a command globally whenever you see '=' character: :g/=/exec some-command

In our case the command is replacing that character by the line number. The line number of a given line can be found using VIM's internal function: line(".")

Hence, the input file is converted into following format:

3|3798:1  3|9450:1  3|12429:1  3|13352:1  3|14155:1  3|15858:1  3|22319:1  3|29652:1  3|31220:1  3|34466:2  3|35883:1  3|37734:1  3|40188:1  3|40683:1
  4|90:1  4|1078:1  4|2101:2  4|3775:1  4|5183:2  4|5195:1  4|5747:1  4|7908:1  4|7963:1  4|9748:1  4|11294:3  4|14879:1  4|16006:2  4|18188:1  4|19742:1  4|20928:1  4|21321:1  4|21935:1  4|23613:1  4|25354:2  4|26721:1  4|29652:1  4|30407:1  4|34054:1  4|36546:2  4|38252:1  4|39376:2  4|40204:1

3. Now, to some garbage cleaning and trivial replacement:

:%s/:/|/g
:%s/  / /g
:%s/^ //g

4. Finally, replace spaces by a newline character:

:%s/ /!!!!!!/g where !!!!!! is press Control+V and then press Enter

Control-V is the special character escape key. Remember the character '\n' won't work here.

And voila you are done !!!

Tuesday, January 17, 2012

Iris - siri like software

Recently I have started a pet project Iris¹ while doing background study for my recent paper.

Voice commands/response:
Like Siri, it takes user commands and can respond using either Voice or Text.

Portable:
Since it is written in Java, it can run on most OS, including Mac, Windows and Linux. I have only tested it on Mac Lion, so I welcome any beta testers using other OS. For now, some of the features like play itunes, open email, etc are only available on Mac. I plan to add these features on other OS as well.

Configurable:
Most of the features such as use of VOICE or TEXT as feedback mechanism, as well as feedback message are configurable through the configuration file ''user_scripts/iris.config". You can also modify any of the predefined apple scripts (for eg: playSong.scpt) in that folder which will be called by Iris on the voice cammand. Also, for advance users, there are ten function commands that he/she can use to perform user-specific tasks.

Easy to install and use:
All you need to do is download and extract the binary file and run following command:

 
java -Xmx200m -jar bin/Iris.jar

The above command asks Java to run the program Iris.jar (in bin folder) and also specifies that it should not use more than 200 MB of memory.

Lightweight:
The whole setup takes about 20MB.

Client-side:
The speech-to-text is done on your laptop/desktop and not on some remote server. This means you can use it even if have no internet connection. This also means that there is no possibility of some big brother monitoring your voice commands.

Open source:
You are free to download and modify the source. As I am working towards a paper deadline, I am looking for collaborators that would help me with following features:

User-friendly: Improving voice commands
More accuracy: Expanding dictionary to include more words and also to add different 'user accents' for a given word.
Expanding grammar to include arbitrary spoken words.
Apple scripts to accessing contacts and recognizing contact names to assist users with various functionality.
Adding GUI interface rather than command line. Since all the responses go through OutputUtility.java, this is going to be easy.
New features: Support for RSS, Top News, Weather, Popular city names, Dictionary/Theasurus, Web/Wikipedia search, Dictation (to text/messages/events/todo item/alarm/reminders), Jokes, Motivational quotes, More Personalization.

If you have suggestions for improvement or want to collaborate with me, email me at 'NiketanPansare AT GMail'.

Iris uses CMU's Sphinx4 for recognizing speech and FreeTTS for text-to-speech conversion.

References:
¹ Iris is name of my friend's dog (well, no pun intended when I use the term 'pet project'). It is also the anagram of the word 'Siri' (a popular Apple's software for IPhone 4s).

Monday, January 02, 2012

C++ workflow

Recently I had a memory bug that took me 2 days to figure out. Valgrind gave no error message (well, to be honest, it did give out a warning that got lost in my log messages). gdb pointed to completely random place that led me to believe STL's stringstream was the culprit. Since there was no way I could have definitively say it's not the cause, I couldn't rule out that possibility.

My impatience got the best of me and I foolishly rewrote that part of code using a less elegant solution using arrays. Now gdb started pointing to STL's vector.clear() method and believe it or not I was using std::string to store in it, not some user defined class …. I had it … there is no way in the world I am going to believe that gdb was right (thanks to my confidence in Effective C++ and std::vector < str::string >). But, not knowing the bug in a 'single threaded' 'command-line based' C++ code with access to gdb and pretty printing was driving me crazy. My previous blogpost didn't help either :(

So, on the same evening, armed with my music player and hot cup of tea, I decided to ignore the bug and beautify my code with better comments … geek alert !!! That's when it hit me, due to my recent modification, I was writing past the array bounds in a for loop. The funny thing is it is one of the few memory bugs that valgrind does not report in its final assessment and my best guess is that I was writing over vector's header location (and vector only complains about that during its next operation (in my case was clear()) ... which for all intensive purposes is random).

Embarrassed by this stupid mistake that wasted my two days, I felt humbled to revisit my C++ work-flow.

So, here is a sample workflow:
- svn update
- Code using your favorite editor (Make sure you configure it to conform to your style rules).
- Set DEBUGFLAGS in your makefile that does static analysis (make my_proj).
- Test it on local machines using 'make my_test'
- Run valgrind if necessary and check against below given valgrind checklist
- make style
- make doc
- svn update
- svn commit -m "some meaningful message"

The key principle is:

Automate whenever possible (with emphasis on reusability)

Never ever start coding without an SVN (or cvs or git, etc.). It will save you lot of headaches if your machine crashes or if you want to revert back to a previous version.
If the project has multiple coders/testers/collaborators, it helps if you have some kind of project-management tool like trac or jira. Also, if you are going to release your code to public, you might want to consider project hosting sites like Google Code or source forge.
Though this might seem obvious, don't host the svn server on personal machine or at least make sure to take regular backups if you do.

Use a Makefile that takes care of all necessary tasks (compilation, cleaning, data generation, document generation, code beautification, etc) and a user-friendly README file. (Advance coders can consider automate). Here is a sample Makefile of my current project:

 
CC = g++
DEBUGFLAGS = -g -Wall -Wextra
MINUITFLAGS = `root-config --cflags` `root-config --libs` -lMinuit2 -fopenmp
LIBS = -lgsl -lgslcblas -lm -lstdc++

HEADER = Helper.h MathHelper.h LDA.h Configuration.h
SRC = main.cpp MathHelper.cpp Helper.cpp LDA.cpp
OUTPUT = spoken_web

spoken_web:     $(SRC) $(HEADER)
        $(CC) $(SRC) -o $(OUTPUT) $(DEBUGFLAGS) $(LIBS)

doc:    $(HEADER) spokenweb_doxygen_configuration
        doxygen spokenweb_doxygen_configuration 

r_data: GenerateData.R
        R --slave < GenerateData.R

style: $(SRC) $(HEADER)
        astyle --options=spokenweb_astyle_rules $(SRC) $(HEADER)

clean:
        rm -f *.o $(OUTPUT) *~ *.data parameters.txt metaFile.txt *.data

Use a document generator like doxygen. It forces you to write much cleaner documentation and more importantly cleaner interface. Trust me it will help if you had to reuse your code in future.
```
make doc
```
Use code beautifier like astyle that not only shields you against inelegant colleague/coder, but also personalizes the code if you like to do 'Code Reading' (Don't you clean gym equipment after workout, well, why not do the same with your code). Here is sample style rules that I love:
```
 
--style=java
--indent=spaces=2
--indent-classes
--indent-switches
--indent-preprocessor
--indent-col1-comments
--min-conditional-indent=0
--max-instatement-indent=80
--unpad-paren
--break-closing-brackets
--add-brackets
--keep-one-line-blocks
--keep-one-line-statements
--convert-tabs
--align-pointer=type
--align-reference=type
--lineend=linux
--suffix=none
```
If your company follows other style guidelines or if you don't want to change existing formatting, remove '--suffix=none' from the style rules and once you are done, you can restore the original unformatted files (*.orig).
```
make style
```
Familiarize yourself with program analyzers and debugging tools, and have scripts that automate them in your coding process:
- g++'s flags (See DEBUGFLAGS in my makefile)
- Static analysis tools like cppcheck (For more cases, I am satisfied with the above g++'s flags)
- Runtime memory analyzer tool like valgrind. (Please see below given Valgrind checklist, and also my previous blogpost).
- gdb: For stepping through the code for more comprehensive debugging
There are several paid tools that do much better job of analyzing your code that those mentioned above. There are
I like to follow standard directory structure:
./ Makefile and configure scripts
./src cc , h files (Use subdirectories and namespaces if necessary)
./bin build directory
./data data required for testing/running the code
./test test drivers
./doc documentation generated by doxygen

Maintain a list of code snippets (stored in Evernote or a personal notebook) or helper classes that you might have to use again. Note that the helper classes should have good interface and must be well documented (and robust). For example, I use GSL for writing sampling functions for bayesian programs, which has very ugly interface and requires some amount of book-keeping. So, I have coded a wrapper class that gives an R like interface which makes my code extremely readable, less error prone and I don't need to worry about any book-keeping or other numerical issues while calling them.

 
/// \defgroup StatDistribution R-like API for family of probability distributions
/// Naming convention:
/// I have followed the naming conventions of R programming language. 
/// Please see its documentation for further details
///
/// Here is a brief intro:
/// Every distribution that R handles has four functions. There is a root name, 
/// for example, the root name for the normal distribution is norm. 
/// This root is prefixed by one of the letters
///   - d for "density", the density function
///   - r for "random", a random variable having the specified distribution
/// Example: For the normal distribution, these functions are dnorm, and rnorm.
///
/// Important note: Since most bayesian calculation is done in log space, (almost) 
/// every density function is capable of outputing 'numerically robust' logarithmic value. 
/// Hence, use dnorm(x, mean, sd, true) in your code rather than log(dnorm(x, mean, sd))
///
/// Suported continuous distribution:
/// - Uniform (unif - r,d)
/// - Univariate Normal (norm - r, d); Truncated normal (tnorm - r,d);
/// - Multivariate Normal (mnorm - r, d, conditional_r, conditional_rt) (--> Only m = 2, 3 supported)
/// - Beta (beta - r,d)
/// - Inverse Gamma (invgamma - r,d)
///
/// Suported discrete distribution:
/// - Discrete Uniform (dunif - r,d)
/// - Dirichlet (dirichlet - r,d)
/// - Multinomial (multinom - r,d)
/// - Poisson (pois - r,d)
/// @{

/// @name Continuous Uniform Distribution
///@{
/// Returns uniform variable [min, max)
double runif(double min, double max);
double dunif(double x, double min, double max, bool returnLog = false);
///@}

/// @name Discrete Uniform Distribution
///@{
/// Returns uniform variable [min, max) --> i.e. [min, max-1]
int rdunif(int min, int max);
double ddunif(int x, int min, int max, bool returnLog = false);
///@}

/// @name Univariate Normal Distribution
///@{
double rnorm(double mean, double sd);
double dnorm(double x, double mean, double sd, bool returnLog = false);
///@}

/// @name Multinomial distribution
/// Both prob and retVal should be of length sizeOfProb
/// size, say N, specifying the total number of objects that are put into 
/// K (or sizeOfProb) boxes in the typical multinomial experiment.
/// If the array prob is not normalized then its entries will be treated as 
/// weights and normalized appropriately. Furthermore, this function also allows 
/// you to provide log probabilities (in which case, you need to set isProbInLogSpace to true)
/// Eg: double prob[3] = {0.1, 0.2, 0.7}; double retVal[3]; rmultinom(1, 3, prob, retVal, false);
///@{
void rmultinom(int size, int sizeOfProb, double* prob, unsigned int* retVal, bool isProbInLogSpace = false);
double dmultinom(int sizeOfProb, double* prob, unsigned int* x, bool isProbInLogSpace = false, bool returnLog = false);
///@}

So to sample N(10,2), instead of having code like:

 
gsl_rng* rng_obj = gsl_rng_alloc(gsl_rng_mt19937);
double sampledVal = gsl_ran_gaussian(rng_obj, 2) + 10;
gsl_rng_free(rng_obj);

I have to use following code (which if you have coded in R is more readable):

 
StatisticsHelper myStatHelper;
double sampledVal = myStatHelper.rnorm(10, 2);

Valgrind checklist:

First run valgrind's memcheck tool (valgrind --tool=memcheck program_name) and check:

The heap summary (only returns memory leaks, not memory corruption errors) Eg:
```
 
==19691== HEAP SUMMARY:
==19691==     in use at exit: 0 bytes in 0 blocks
==19691==   total heap usage: 12,126 allocs, 12,126 frees, 938,522 bytes allocated
==19691== 
==19691== All heap blocks were freed -- no leaks are possible
```
If number of allocs and the number of frees will differ in o/p, the try
valgrind --tool=memcheck --leak-check=yes --show-reachable=yes program_name

Steps to avoid memory leaks:
- If you are assigning pointers to other pointers, be very explicit as to who is responsible for the deletion of the memory.
- Delete every object created by new (especially in case of exceptions).
Intermediate warning/error messages irrespective of the heap summary. Eg:
```
 
==15633== Invalid write of size 4
==15633==    at 0x4163F3: LDAGibbsSampler::FillInSpeechToTextOutputs() (LDA.cpp:206)
==15633==    by 0x41565A: LDAGibbsSampler::LDAGibbsSampler(int, int, double, double, std::string, double, double, int) (LDA.cpp:28)
==15633==    by 0x402E26: main (main.cpp:34)
```
Note, sometime one error can propagate and throw error messages at different places, so don't panic.

The intermediate error messages can detect:
- Illegal read / Illegal write errors (like the message 'Invalid read of size 4' above): It can mean any of the memory corruptions given below. If the error message is not informative, run valgrind again with '--read-var-info=yes' flag.
  Common memory corruption bugs like these are:
  - Writing/Accessing out of bounds memory (either directly or by array oriented functions like strcpy)
  - Using an unallocated object
```
 
my_class* obj_ptr;
obj_ptr -> var1 = val; // or val = obj_ptr -> var1;
```
  - Using a freed object
  Note: Memcheck (tool of valgrind) does not perform bounds checking for stack or global arrays, so even if you don't get this error/warning message, it does not mean that there is no memory corruption errors.
- Use of uninitialised values (either on stack or heap) in user function: Message looks like 'Conditional jump or move depends on uninitialised value(s)'. To see information on the sources of uninitialised data in your program, use the '--track-origins=yes' option.
- Use of uninitialised values in system calls: Message looks like 'Syscall param write(buf) points to uninitialised byte(s)'.
- Illegal free/deletions: (Invalid free())
  - Calling free() (or delete) on the object that is already freed (or deleted).
  - Deleting object that is created on stack.
- Mismatched operators: (Mismatched free() / delete / delete [])
  - Calling free() on the object that is created by new operator.
  - Calling delete on the object that is created by malloc().
  - Calling delete[] on the object that is created by the new operator.
- Overlapping source and destination blocks: (Source and destination overlap in ...)
  - Look out functions like memcpy, strcpy, strncpy, strcat and strncat.

References:
- http://google-styleguide.googlecode.com/svn/trunk/cppguide.xml
- http://www.yolinux.com/TUTORIALS/C++MemoryCorruptionAndMemoryLeaks.html
- http://www.cprogramming.com/tutorial/memory_debugging_parallel_inspector.html

Thursday, July 23, 2009

Mistakes you can make in your first job

1. Not taking time to study the fundamentals.
Eg: If you are expected to write C++ program, first learn C++. Take time to understand classes, inheritance, composition, etc. By understand, I mean you not only need to know the syntax, but also imagine the scenarios in which they are useful (an which they are not). If you don't know STL or smart pointers, believe me you would be re-inventing some of its code inefficiently :) and your code might be prone to bugs.

2. Google search before you ask your team lead or team members. This protects you from asking stupid questions or worse committing stupid mistakes.

3. Under-estimating the importance of "requirement gathering" and "design" phase aka Start coding immediately. A good programmer always takes time to understand the problem and create a big picture of the project in his/her head.

4. Trying too hard to present what you are not. It is extremely difficult to live a lie. Remember honesty should be coupled with extreme desire to improve yourself and also hard-work.

5. Blaming others to improve your chances of staying in the job. Believe me nobody likes a snitch. If you commit a mistake, be brave and accept the responsibility. Be proactive and ask the mentor for what you can do to improve your chances instead.

6. Disregarding "Murphy's law" and not clearing your expectations with team members.
If you think there is slightest possiblity of any team member to commit certain mistake or make certain wrong assumption, clarify it.

7. Delving into details without getting the big picture (kind of re-statement of point 3). But this is so common that I need to mention it again.

8. While using new software library, not differentiating the interface part and core problem part of the project.

9. Ignoring the coding standards of company. If you company does not have one, try to follow the coding standards given in the book C++ Coding Standards: 101 Rules, Guidelines, and Best Practices.

10. Reinventing the wheel in your design aka not knowing what are design patterns. If you don't know what design pattern mean read Head First Design Pattern. If you have already taken course in design patterns, then revising Gang of Four book won't hurt.

11. Not using a version control. Chances are that you would break the working code when you are trying to add new functionality. If you are using a version control, then remember to add messages like "Working version 1.4", etc. and learn how to revert back to certain version. If you are not allowed to use version control, make copies whenever you implement and test new functionality.

12. Not discussing your design with others aka school-children syndrome. Think of it this way rather than sharing your trade-secrets; by explaining or helping others, you improve your knowledge base.

13. Not respecting chain of command. Always give credits to your mentors, even if you think they have not helped you in it. It is even worse, if you complain against your mentors to their managers. Remember habit 4 (Think Win-win) of 7 habits of highly effective people.

Two books you must read and re-read before joining the job,
1. 7 habits of highly successful people by Stephen Covey. (Summary)
2. Getting things done by David Allen. (Summary)

Sunday, June 21, 2009

Shuffle functions

Very frequency asked question is "Write a program to shuffle a set of cards". There are many approaches to solve this problem.

Almost all of them treat the set of cards as a list and write the program to shuffle the list.

One approach is to emulate the human shuffling:
Hindu Shuffle; Riffle shuffle

Other techniques are discussed in more details in wikipedia

Here is an interesting but not optimal way(O(n lgn)) to shuffle a list:
In Ocaml:

let shuffle l1 =
let sortComparator x y =
(
match (Random.int 3) with
0 -> 0
| 1 -> -1
| 2 -> 1
| _ -> failwith "Error in shuffle"
)
in
List.sort sortComparator l1
;;

C++ Code:

template < class T >
bool sortComparator(T x, T y){
if(rand % 2 == 0)
true;
else
false
}

template < class T >
vector < t > shuffle(vector < t > cards) {
sort(cards.begin(), cards.end(), sortComparator);
return cards;
}

Friday, March 06, 2009

How to access the machine in CISE labs @ UF

Here is CISE's official webpage for accessing CISE public machines: http://www.cise.ufl.edu/help/access/index.shtml

Here is quick summary:

To access linux or solaris cise machines, you can

SSH using "Putty" on thunder.cise.ufl.edu (or sand, rain, bay, shine, storm, ...)

or SCP using "WinSCP"

To access Windows machine, you can remote desktop (or rdesktop) to new-mercury.cise.ufl.edu

If you have an office in CISE labs, you can access you machine using following steps:

Note: You will need a CISE account and password to the local machine. :)

You can replace 10.227.170.147 using your IP address.

To be done once: install vnc server and openssh in you session and vnc client on you home machine

Start vnc server on the machine (in CISE Lab)

1. Locally using command: vnc4server :n (where n is some number say eg. :2)

or on remote machine - these are steps

1. open putty

2. ssh to rain

3. ssh to 10.227.170.147 from rain

4. vnc4server :n (where n is some number say eg. :2)

1. open putty

2. Go to Connections > SSH > Tunnels

Source port: 5901

Destination: 10.227.170.147:5901

3. Connect to rain.cise.ufl.edu n leave the session as it is

4. Open VNC client and connect to 127.0.0.1:1

5. Enter password in the small dialog box

6. gnome-session (if you get a terminal, you can skip this step by configuring the vnc server settings)

Sunday, September 14, 2008

GatorMemo version 3.0 is ready !!!

After I decided to do PhD, I always wanted to improve my study process (especially my notes management system). After several discussions with my internship manager Raman Iyer, my friends Raja Nadar and Kishore Malani, I decided to go ahead with GatorMemo. It is build on principles of SuperMemo (spaced learning) and Mindmaps.

It started out as Windows Form application which resulted in version 2.0. I feel the user interface is too complicated for normal user, so decided to build it again from scratch using WPF instead of WinForms. Functionally, there is no difference between these two versions (except WPF doesnt have facility to play audio/video files. I might add them as soon as I get time to work on it).

Suggestions:
1. Read a book, research paper, or go through lectures/podcast/videos, etc ...
2. Then, create a mindmap after completing understanding the topic. I usually refer to 3 or more books and then organize my thoughts into mindmap. I strongly suggest to put huge amount of efforts to create a good mindmap.
3. Now, create card for each node. Front side of card can be a question about the node or simply node name and Back side can be the answer or description about the topic that the node represents.
4. I recommend adding mnemonics and bibliography to mindmap and cards. Mnemonic will help you recollect the subject matter and Bibliography will be extremely useful if you are planning to write a research paper.
5. Periodically, revise the card by using the Autolearn under Action tab (This software uses SuperMemo algorithm developed by Piotr Woznaik).
6. I also suggest using Lateral Thinking especially if you are planning to write a research papers.

Here is link to setup version 3.0: (I encourage you to to use this version)
http://niketan2.googlepages.com/GatorMemoSetup.msi

You can import following file (see Action tab):
http://niketan2.googlepages.com/OSMindMap.zip to see how s/w works :)

It is still in Beta phase, so expect bugs / exceptions :) ... (Infact I am so busy with my studies/assignments, that I did not time to any kind of corner case testing). Please mail me if you find bugs and help me improve the software.

If you are interested in improving the software, go ahead and download the code and modify it.

Monday, April 07, 2008

Random C++ tricks - GDB, Valgrind and Makefile

1. Perfect programming is hard - even the best programmers make mistakes.
2. A line of code that you haven't tested is a line of code with a bug.

Cardinal rule for debugging: If you are using gcc or g++, compile your program with -g option.
(Btw, to link a library with gcc, you use -l option. Eg: to link math library, use:
gcc filename -o outputfile -lm)

Segmentation Faults:

Any access outside that area (memory area assigned to the program) will cause a segmentation fault.

Common mistakes of segmentation faults:
1. Deferencing
a. NULL
b. an uninitialized pointer
c. a pointer that has been freed (or deleted, in C++) or that has gone out of scope (in the case of arrays declared in functions)
2. writing off the end of an array
3. Calling a recursive function that uses all of the stack space (On some systems, this will cause a "stack overflow" report)

Attacking segfaults:

1.a Deferencing NULL

$gdb myPgm core
The core file contains all the information needed by GDB to reconstruct the state of execution when the invalid operation caused a segmentation fault.

It will o/p:

Some copyright info
Core was generated by `myPgm'.
Program terminated with signal 11, Segmentation fault.
Some information about loading symbols
#0 0x0804838c in foo() () at myPgm.cpp:4
4 *x = 3;

This tells us that the execution stopped inside the function called foo() on line 4, which happened to be the assignment of the number 3 to the location pointed to by x.

(gdb) list
1 void foo()
2 {
3 char *x = 0;
4 *x = 3;
5 }
6
7 int main()
8 {
9 foo();
10 return 0;
(gdb) print x
$1 = 0x0

A little complicated eg: execution crashing inside a system call or library function (perhaps because we passed an uninitialized pointer)

#0 0x40194f93 in strcat () from /lib/tls/libc.so.6
(gdb) bt
#0 0x40194f93 in strcat () from /lib/tls/libc.so.6
#1 0x080483c9 in foo() () at t.cpp:6
#2 0x080483e3 in main () at t.cpp:11

so now we are in strcat stack but we want to see variable that was passed to strcat:

(gdb) up
#1 0x080483c9 in foo() () at t.cpp:6
6 strcat(x, "end");
(gdb) print x
$1 = 0x0

A common mistake is to not check the return from malloc to make sure that the system isn't out of memory. Another common mistake is to assume that a function that calls malloc doesn't return NULL even though it returns the result of malloc. Note that in C++, when you call new, it will throw an exception, bad_alloc, if sufficient memory cannot be allocated. Your code should be prepared to handle this situation cleanly, and if you choose to catch the exception and return NULL inside a function that ordinarily returns a new'ed pointer, this advice still holds.

1.b Deferencing an uninitialized pointer
Figuring out whether or not a pointer has been initialized is a bit harder than figuring out whether a pointer is NULL. The best way to avoid using an uninitialized pointer is to set your pointers to NULL when you declare them (or immediately initialize them).

You might need to figure out if 0x4025e800 is valid memory. One way you can get a sense of this in GDB is by printing out the addresses stored in other pointers you've allocated. If they're fairly close together, you've probably correctly allocated memory. Of course, there's no guarantee that this rule of thumb will hold on all systems.

In some cases, your debugger can tell you that an address is invalid based on the value stored in the pointer.

(gdb) print x
$1 = 0x1e <out of bounds>
(gdb) print *x
Cannot access memory at address 0x1e

1.c Dereferencing Freed Memory
This is another tricky bug to find because you're working with memory addresses that look valid. The best way to handle such a situation is again preventative: set your pointer to point to NULL as soon as you've freed it.

Another form of this bug is the problem of dealing with memory that has gone out of scope.

char *return_buffer()
{
char x[10];
strncpy(x, "a string", sizeof(x));
return x;
}

This is a really tricky bug to find because once again the memory address will look valid when you print it out in GDB.

If that pointer is causing you trouble, check the function and look for whether the pointer is pointing to a local variable in the function. Note that it is perfectly fine to return a pointer to memory allocated in the function using new or malloc, but not to return a pointer to a statically declared array (e.g., char x[10]).

2. writing off the end of an array
Valgrind will help you catch this bug on dynamically allocated arrays. See point 2 below.

Trick:
If notice that some of your variable values are changing periodically and unexpectedly. This is a tough bug to crack; one option is to set up your debugger to watch a variable for changes and run your program until the variable's value changes. Your debugger will break on that instruction, and you can poke around to figure out if that behavior is unexpected.

(gdb) watch [variable name]
Hardware watchpoint 1: [variable name]
(gdb) continue
...
Hardware watchpoint 1: [variable name]

Old value = [value1]
New value = [value2]

This approach can get tricky when you're dealing with a lot of dynamically allocated memory and it's not entirely clear what you should watch.

3. Stack overflow:
To diagnose a stack overflow in GDB, typically you just need to do a backtrace. If you find a single function call piling up an awfully large number of times, this is a good indication of a stack overflow.

(gdb) backtrace
#0 foo() () at t.cpp:5
#1 0x08048404 in foo() () at t.cpp:5
#2 0x08048404 in foo() () at t.cpp:5
#3 0x08048404 in foo() () at t.cpp:5
[...]
#20 0x08048404 in foo() () at t.cpp:5
#21 0x08048404 in foo() () at t.cpp:5
#22 0x08048404 in foo() () at t.cpp:5
---Type to continue, or q to quit---

Typically, you need to analyze your recursive function to make sure that all the base cases (the cases in which the function should not call itself) are covered correctly.

Other cool tricks with gdb:
We discussed bt, p (print), list, up

To set a breakpoint, type break <line number>
<line number> can be <sourcefile>:<line number>

Use n (or next) to progress through the loop one line at a time, r to run the program and c to continue.

There is also another set of commands, although they are not very commonly used; these commands actually modify the program as it is running. The call command, for example, will call a particular function; the set variable command will set a variable in the program to a particular value, for example, set variable i = 0. Return will make the current function return to the caller.

Using valgrind:

1. Finding memory leaks:
Compile your program with -g option.

$valgrind --tool=memcheck program_name

If number of allocs and the number of frees will differ in o/p, the try

$valgrind --tool=memcheck --leak-check=yes --show-reachable=yes program_name

O/p is somewhat like:

==pid== 100 bytes in 1 blocks are definitely lost in loss record 1 of 1
==pid== at 0x1B900DD0: malloc (vg_replace_malloc.c:131)
==pid== by 0x804840F: main (myPgm.c:5)

This means that there is a memory leak at line 5 in myPgm.c

2. Finding invalid pointers:
Warning of form:
'Invalid read of size X'or 'Invalid write of size X'
where X is the amount of memory we try to read.

Eg: if you allocate an array with malloc or new and then try to access a location past the end of the array:

int main()
{
char *x = malloc(10);
x[10] = 'a';
return 0;
}

$valgrind --tool=memcheck --leak-check=yes --show-reachable=yes program_name

This will give o/p:

==pid== Invalid write of size 1
==pid== at 0x804841E: main (myPgm2.c:6)
==pid== Address 0x1BA3607A is 0 bytes after a block of size 10 alloc'd
==pid== at 0x1B900DD0: malloc (vg_replace_malloc.c:131)
==pid== by 0x804840F: main (myPgm2.c:5)

3. Detecting the use of uninitialized variables:
Warning: 'uninitialised' value(s)

Eg:

int foo(int x)
{
if(x < 10)
{
printf("x is less than 10\n");
}
}

int main()
{
int y;
foo(y);
}

o/p:

==pid== Conditional jump or move depends on uninitialised value(s)
==pid== at 0x8048366: foo (myPgm3.c:5)
==pid== by 0x8048394: main (myPgm3.c:14)

4. Other improper use of memory:

4.a: If you call free twice on the same pointer value, Valgrind will detect this for you; you'll get an error:

Invalid free()

This is also called "double free attack":

free(x);
/* code */
free(x);

The easiest way to avoid it is simply to set your pointer to point to NULL once you've freed it:

free(x); x = NULL;
/* code */
free(x);

4.b: free should be matched with corresponding malloc. Similarly, delete with new and delete[] with new[].
If you do mismatch, you will get following error:

Mismatched free() / delete / delete []

What won't Valgrind find?
Valgrind doesn't perform bounds checking on static arrays (allocated on the stack).
Eg:

int main()
{
char x[10];
x[11] = 'a';
}

Valgrind won't alert you!

Also, if you don't test for buffer overflows by using long input strings, Valgrind won't tell you that your code is capable of writing over memory that it shouldn't be touching.

------------------------

Some common bugs in C/C++:
1. Uninitialized variables
2. Using a single equal sign to check equality:
while(x='Y') { ....}
3. Extra Semicolons:

int x;
for(x=0; x<100; x++);
Do something with x;

4. Overstepping array boundaries (especially if it static array):

int array[10];
//...
for(int x=1; x<=10; x++) do something with array[x];

-------------------------
Make file: Simple eg:

CC = gcc
FILES = in_one.c in_two.c
OUT_EXE = out_executable

build: $(FILES)
$(CC) -o $(OUT_EXE) $(FILES)

clean:
rm -f *.o core

rebuild: clean build

Advanced eg:
1. $@ is the name of the target.

client: client.c
$(CC) client.c -o $@

server: server.c
$(CC) server.c -o $@

List of dependents
$? - more recent than the target(i.e., those that have changed since the last time make was invoked for the given target)
$^ - all dependencies, regardless of whether they are more recent than the target, but removes duplicate names
$+ - like $^, but it keeps duplicates
$> - first dependency

Wildcard Matching in Targets (percent sign, %)
When a % appears in the dependencies list, it replaces the same string of text throughout the command in makefile target. If you wish to use the matched text in the target itself, use the special variable $*.
Eg:

%.c:
gcc -o $* $*.c

Implicit Targets

.c:
$(CC) -o $@ $@.c

This rule says that for any target that corresponds to a .c file, make should compile it using the name of the implicit target as the output, and the name plus the .c extension as the file to compile.

Replacing Text
To specify that OBJ is equivalent to SRC, except with the .c extension replaced with a .o extension:

OBJ = $(SRC:.c=.o)

Visit CProgramming.com for more details

Sunday, January 13, 2008

Microsoft Interview

I had an interview for internship at Microsoft. I was interviewed by SQL Server Analysis Services and SQL Server Reporting Services (I got offer from both :) ).

I was briefed by my recruitment coordinator at building 19. I had my interviews in building 34 and 35.

My interview questions were pretty easy. Here are some of them:
1. Write a program to create a balanced binary tree from a sorted array of integer.
>> Find the middle element. Make it root. recursively call the function for left and right subarray with root->left and root->right respectively.

2. Write a program to find angle between clock's minute and hour hand
>> assert(h >= 0 && m >=0)
return ( abs( 0.5*((h%12)*60 + m%60) - 6*(m%60)));

3. Consider there are 2 types of characters:
1byte character - has value less than 128
2 byte character - whose first byte has value >= 128
Given a random pointer in a byte array find the previous character

I was asked to solve the simpler version of the problem: i.e I was given the pointer to beginning of array.
>> Start from beginning of array till pointer (also keep track of prev character), if value <> 1 byte character. Goto next byte.
If value >= 128, => 2 byte character, move 2 bytes.

Lets assume if I was not given the begin pointer:
>> Assuming turing tape: the problem has atleast 1 non halting solution. Consider, tape filled with 128. :)

4. Given a string, reverse first n characters in each word of the string.
Eg: if string is "abcd ef"
and n =2 => o/p: "badc fe"
and if n =3 => o/p: "cbad ef" (if less than n characters, donot reverse)

5. Some probability based questions: 3 types of ball in bag with equal probability. Some of the atleast, atmax kind of questions. I am sure almost everyone has solved these type of questions in their undergrad.

6. Problem with Concurrent Reader, Exclusive Writer (CREW):
A writer may wait forever.
How to solve it?
Assign reader_priority = writer_priority = normal
When number of readers exceeds a threshold, --reader_priority
When writer writes and if reader_priority < normal, ++reader_priority

If reader_priority < norm, and if a reader comes, it will wait if there is writer waiting in the queue.

My suggestions:
Be confident. Revise Data Structures and Algorithms. Write few programs for practice.
Also talk to interviewer, clarify assumptions and state your algorithm before starting to write on the white board. Always check for boundary conditions (this is where most candidates get filtered).

Wednesday, September 19, 2007

Mixing DOS and Unix

I wrote a java program using Notepad in Windows and when I opened it in vi editor on unix machine, I got ^M character at end of each line.

Eg:
import java.io.*;^M
^M
/**^M
*^M
*This class is used to manage the o/p stream of the application.^M
* ^M
*@author Niketan Pansare^M
*@version 1.0 ^M
* ^M
*/^M
public class OutputManager^M
{^M
OutputStream outputStream; ^M
^M
/**^M
* This constructor uses standard i/o as default stream.^M
*/^M

and so on ...

You can replace all the extra ^M in vi editor by:
1. Go in command mode (Press Esc - If you already are in command mode, you will hear a beep)
2. Then type
:%s/^M$//g
Don't copy and paste above lines. To add ^M, press (CTRL+V) + (CTRL+M) ie ^V+^M

What does above command mean:
For substitution you use following command
:[range]s/[pattern]/[string]/[options]

's' here mean substitute a pattern with a string.

range can be
{number} an absolute line number
. the current line
$ the last line in the file
% equal to 1,$ (the entire file)

You can also use # instead of / as seperator.

Technically you can define pattern as:
The definition of a pattern: *search_pattern*

Patterns may contain special characters, depending on the setting of the
'magic' option.

*/bar* */\bar*
1. A pattern is one or more branches, separated by "\|". It matches anything
that matches one of the branches. Example: "foo\|beep" matches "foo" and
"beep".

2. A branch is one or more pieces, concatenated. It matches a match for the
first, followed by a match for the second, etc. Example: "foo[0-9]beep",
first match "foo", then a digit and then "beep".

3. A piece is an atom, possibly followed by:
magic nomagic
*/star* */\star*
* \* matches 0 or more of the preceding atom
*/\+*
\+ \+ matches 1 or more of the preceding atom {not in Vi}
*/\=*
\= \= matches 0 or 1 of the preceding atom {not in Vi}

Examples:
.* .\* matches anything, also empty string
^.\+$ ^.\+$ matches any non-empty line
foo\= foo\= matches "fo" and "foo"

4. An atom can be:
- One of these five:
magic nomagic
^ ^ at beginning of pattern, matches start of line */^*
$ $ at end of pattern or in front of "\|", */$*
matches end of line
. \. matches any single character */.* */\.*
\< \<> \> matches the end of a word */\>*
\i \i matches any identifier character (see */\i*
'isident' option) {not in Vi}
\I \I like "\i", but excluding digits {not in Vi} */\I*
\k \k matches any keyword character (see */\k*
'iskeyword' option) {not in Vi}
\K \K like "\k", but excluding digits {not in Vi} */\K*
\f \f matches any file name character (see */\f*
'isfname' option) {not in Vi}
\F \F like "\f", but excluding digits {not in Vi} */\F*
\p \p matches any printable character (see */\p*
'isprint' option) {not in Vi}
\P \P like "\p", but excluding digits {not in Vi} */\P*
\e \e */\e*
\t \t */\t*
\r \r */\r*
\b \b */\b*
~ \~ matches the last given substitute string */~* */\~*
  A pattern enclosed by escaped parentheses */*
(e.g., "$^a$") matches that pattern
x x A single character, with no special meaning,
matches itself
\x \x A backslash followed by a single character, */\*
with no special meaning, matches the single
character
[] \[] A range. This is a sequence of characters */[]*
enclosed in "[]" or "\[]". It matches any */\[]*
single character from the sequence. If the
sequence begins with "^", it matches any
single character NOT in the sequence. If two
characters in the sequence are separated by '-', this
is shorthand for the full list of ASCII characters
between them. E.g., "[0-9]" matches any decimal
digit. To include a literal "]" in the sequence, make
it the first character (following a possible "^").
E.g., "[]xyz]" or "[^]xyz]". To include a literal
'-', make it the first or last character.

If the 'ignorecase' option is on, the case of letters is ignored.

It is impossible to have a pattern that contains a line break.

Examples:
^beep( Probably the start of the C function "beep".

[a-zA-Z]$ Any alphabetic character at the end of a line.

\<\I\i or $^\|[^a-zA-Z0-9_]$[a-zA-Z_]\+[a-zA-Z0-9_]* A C identifier (will stop in front of it). $\.$\|\. $ A period followed by end-of-line or a space. Note that "$\. \|\.$$" does not do the same, because '$' is not end-of-line in front of '\)'. This was done to remain Vi-compatible. [.!?][])"']*$$\|[ ]$ A search pattern that finds the end of a sentence, with almost the same definition as the ")" command. Technical detail: characters in the file are stored as in memory. In the display they are shown as "^@". The translation is done when reading and writing files. To match a with a search pattern you can just enter CTRL-@ or "CTRL-V 000". This is probably just what you expect. Internally the character is replaced with a in the search pattern. What is unusual is that typing CTRL-V CTRL-J also inserts a , thus also searches for a in the file. {Vi cannot handle characters in the file at all}

Seems complex ??
Let us stick to our example for time being.

$ mean end of line
Therefore, ^M$ mean any ^M character at end of line.
It is supposed to be replaced by nothing (since in our example, string is empty).

g at end mean "global"