Niketan Pansare's blog

Installing Apache SystemML on the IBM Cloud

2018-04-13T15:57:00.004-04:00

To use the latest released version of Apache SystemML on the IBM Cloud, please install it using the following commands in the jupyter shell:

1. Install the latest version of SystemML (for example: 1.2.0):

!pip install systemml

2.. Please execute the following command in the notebook after installation:

!ln -s -f ~/user-libs/python3/systemml/systemml-java/systemml-1.2.0-extra.jar ~/user-libs/spark2/systemml-1.2.0-extra.jar
!ln -s -f ~/user-libs/python3/systemml/systemml-java/systemml-1.2.0.jar ~/user-libs/spark2/systemml-1.2.0.jar

3. Then restart the kernel.

4. check if the correct version was installed:

!pip show systemml

5. (Important) Finally, check if the correct version is being picked up:

from systemml import MLContext
ml = MLContext(spark)
ml.version()

Notes from the Micro-MBA course (Part 1)

2018-01-16T17:51:00.001-05:00

In my previous blogposts 1 2 3, I summarized the notes from entrepreneurship course offered by Rice Center for Engineering Leadership. This blogpost will extend the previous blogposts for larger organization based on the notes from the Micro-MBA class at IBM Almaden Research Center. As a disclaimer: this blogpost should be viewed as my personal biased opinions based on limited understanding of the subject, rather than any official position of IBM or the instructor. Also, I will skip many topics for brevity.

The Micro-MBA course can divided into following three sub-topics:

Economics
Finance
Accounting

Economics

The fundamental assumption of economics is of scarcity: People have unlimited wants but limited resources, namely:

Labor (human time and work)
Natural resources (raw materials)
Capital (man-made materials needed for the production of other goods such as buildins and machinery).

Hence, we have to chose how to use the limited resources to maximize our wants and needs. An economist studies the choices to following three fundamental questions:

WHAT (goods and services) to produce ?
HOW to produce them ?
WHO will receive the goods and services produced ?

WHO will receive the goods and services produced ?

The answer to the third question depends on what are the economic policies of the country. In a country with centrally planned economy such as Cuba and North Korea (and Soviet Union from 1917 to 1991), the government decides how the goods and service produced be allocated. On the other hand, in market economy, the individuals/households/firms who are most willing and able to buy the goods and services receive them (i.e. based on principle of supply and demand). Note, most countries today follow mixed economy, where most economic decisions results from the interactions of the buyers and sellers in the market, but in which the government plays a significant role in the allocation of resources [1].

	Centrally planned economy	Market economy
Productive efficiency (produced at lowest cost)		Better
Allocative efficiency (reflects consumer preferences)		Better
Equity (fair distribution of goods and services)	Better
Freedom	Better for freedom from uncertainty	Better for freedom of choice
Main proponent	Karl Marx, who posited in Das Kapital that market competitions exploits labor and ultimately fosters monopolies.	Adam Smith, on the other hand claimed that competition and the self-interested pursuit of profit causes the firms and the individuals to behave in ways that are ultimately socially beneficial.

WHAT (goods and services) to produce and HOW to produce them ?

As described above, in a market economy, self-interested pursuit of profit motivates people to decide what goods and service to produce. The profit is high if the consumer is willing to buy the goods and service for higher price that the cost of goods and service. Every economic decision that a consumer makes takes account of utility (the amount of benefit to the consumer) and comes with an opportunity cost [3]. Opportunity cost is the benefit, profit, or value of something that must be given up in order to have something else.

Opportunity cost

Let's understand opportunity cost by examining the problem of whether I should have done PhD after my Masters. After my Masters, I had two options:

Accept the job offer for Software Development Engineer (SDE) position in Microsoft SQL Server Data Mining team. The salary of this position is $ x per year.
Do 5-year PhD program at Rice university. Rice University gives PhD students a stipend of $ y per year. After PhD, I got an offer from IBM Research with a salary of $ x + z per year.

Assuming the compensation at Rice, Microsoft and IBM remain constant, the opportunity cost associated with doing a PhD after n years is $ x*n - (x + z)*(n - 5) - 5y. The below figure (generated by crude approximation) shows that it makes sense to do PhD if n > 10.5, where the opportunity cost becomes 0.

The above example ignores the rate of increase of compensation of a Masters vs PhD student and also the intangible costs such as experience at Microsoft, potential cutting-edge role at IBM, guidance under Chris, etc.

Specialization and Voluntary Trade

In the book The Wealth of Nations, Adam Smith highlights that two key points:

Workers produce more when they occupy specialized roles, so businesses can offer higher quality products at lower prices.
Trade benefits both buyer and seller (i.e. lower opportunity cost for both of them). Note, this is only true iff both the buyer and seller are engaged in the trade voluntarily and if both have complete information. The counter-examples for voluntary exchange are slavery, human trafficking and fraud, where government regulations are necessary.

To summarize, a corporation should attempt to:
Step 1: specialize such that it is the lowest cost provider
Step 2: apply opportunity cost in decision making, risk assessment (risk vs return) and pricing analysis (for example: if supply is high, price is $1 less than second cheapest seller and if supply is low, the price is $1 more than second highest bidder/buyer).
Step 3: exchange between two opportunity cost such that both parties are better off (i.e. producer sells to make profit and the consumer buys as it is more expensive to make it).

Monetary system

Except countries like Germany, South Korea, Switzerland, Norway, etc, the government spends more than its revenues. The difference is called as deficit spending or budget deficit or simply deficit.

To pay for the deficit, the Treasury borrows the money from the banks by issuing them treasury bonds. Treasury bonds offer interests and promise by the government to pay the principal and hence are considered as safest investments. However, in recent times, countries like Ukraine, Argentina, Greece, Zimbabwe and Russia were either defaulted or had to restructure their debts (see wikipedia for the exhaustive list). The big three rating agencies - Standard & Poor's, Moody's and Fitch - rate countries on their ability to pay interest, refinance and repay them.

Then through Open Market Operations (or OMO), the banks gets to sell the treasury bonds (or mortgage-backed securities) to the Federal Reserve (or Fed for short) at a profit. Interesting point to note here: Banks give Fed the treasury bonds and Fed adds "credit" to the bank reserves. As countries' central bank, Fed has unique power to create credit out of thin air without having money to back the credit up ... often referred to as "printing money".

Other than buying treasury bonds, Fed has another tool at its disposal: the reserve requirement (or cash reserve ratio or liquidity ratio), which refers to percentage of cash that must be kept at Fed or in the vault by a commercial. If a bank makes too many loans or if there is a sudden demand for withdrawals, then amount on money in the bank's vault falls below the reserve requirements. In this case, the bank first attempts to borrow money from other banks at the fed funds rate and then the remaining amount from the Fed at the discount rate. An equivalent interbank lending rate to feds funds rate in UK is LIBOR rate - the London Interbank Offered Rate (set by the British Banker's Association).

In practice, there is no standard fed funds rate, instead it is the weighted average of the interest rate banks charge each other overnight to meet the reserve requirements. The Federal Open Market Committee (FOMC) meets eight times a year to set the target fed funds rate and then by buying/selling of treasury bonds, it attempts to achieve the target fed funds rate. The Fed funds rate affect the prime rate (for setting home equity lines of credit, auto loans, personal loans and credit card rates) and 11th District cost of funds index (for setting adjustable-rate mortgages).

	Jan 2018	Dec 2017	Jan 2017
Prime rate reported by the Wall Street Journal's bank survey	4.5	4.25	3.75
Discount rate	2	1.75	1.25
Fed funds rate	1.5	1.5	0.75
11^th District cost of funds	0.746	0.737	0.603

Treasury then can use the money it received by selling the bonds to fund the deficit. Bank can use the credit it received from Fed and the money deposited by its customers minus the reserve requirements to make loans. This is called as fractional reserve lending and it works as follows. Let's assume that reserve requirements is 10% and X deposits $1000 in bank A. The bank A keeps $100 in its vault and lends remaining $900 to Y who needs it to fund her business. Y has an account in bank B and deposits $900 in her account which she intends to remove as and when needed. Bank B keeps $90 in its vault and lends remaining $810 to C who in turns deposits in bank W. Now, when A, B and C check their accounts in their respective banks, they will see that they have $1000, $900 and $810 respectively, that is the initial $1000 has grown to $2710 in the economy. This system fails if there is a mass panic and A, B and C all decide to remove the amount at once (commonly referred to as bank run).

Before we move ahead, it is a good idea to classify different types of banks:

Supranational banks:

The International Monetary Fun (IMF) ensures the stability of the international economic system and has 187 member countries.
World bank provides financial help to combat poverty, malnutrition and lack of communications and infrastructure to the developing countries.

Central banks such as Federal Reserve, European Central Bank, Bank of England, Reserve Bank of India (RBI), Swiss National Bank and The People's Bank of China (PBC), set monetary policy described in the below section.
Commercial bank's business model is to charge higher interest rate for loan than they pay depositors.
Investment banks provide three functions: Merger and Acquisition (M&E) facility for corporations to raise debt or equity, Institutional Client Services which includes Fixed Income, Commodities and Currencies (FICC) and Asset/Wealth Management for high net worth clients.

Government intervention

The government keeps the economy growing (close to long-run trend rate of 2.5%), limits unemployment and keep inflation low (inflation target of 2%) by fiscal and monetary policies:

Fiscal policy: is set by the US Congress, the President and the Treasury Secretary, which involves changing government spending and taxation.
Monetary policy: is set by the central bank (i.e. Federal Reserve in US), which involves influencing the demand and supply of money, primarily through the use of interest rates.

Inflation refers to a general increase in the price of goods and services and is caused by increase in the supply of money. Though it is hard to define/measure precisely, it is commonly measured by consumer price index (or CPI) which is weighted average change in the prices of consumer goods and services. Government love inflation but just to the right amount (~2%) as it makes debt cheaper (as dollar becomes cheaper), increases taxes (as prices increases) and increases exports. On the other hand, fixed income retirees and importers (such as oil) hate inflation.

A country is in recession when two successive quarters, or a six months, show a decrease in real GDP. Hence, during recession, we get a fall in the inflation rate. However, deflation is when we get a negative inflation rate i.e. falling prices. Since the second world war, recessions have not led to deflation – just a lower inflation rate. A depression is a severe recession.

			Fiscal Policy		Monetary Policy
		Set by:	US Congress, President and Treasury Secretary		Federal Reserve
		Measured by	Tax	Spending	Interest Rates	Treasury Bonds	Reserve requirements
Inflation	Prices ↑	Consumer Price Index (CPI)	↑	↓	↑	Sell	Increase
Recession	Low output and unemployment	GDP/GNP, Unemployment rate	↓	↑	↓	Buy	Decrease

For example: to reduce inflationary pressures,

a fiscal policy will involve higher taxes and lower spending. The advantage of using fiscal policy is that it will help to reduce the budget deficit, but is often difficult to implement due to political reasons.
a monetary policy (in this case termed as tight monetary policy) will attempt to reduce the supply of money by raising the interest rates and borrowing rates among banks, selling government bonds (also called as "open market operations" or OMO) and by increasing the rate of minimum reserve to be kept by commercial banks towards central bank.

Extreme situations:

Buying too many treasury bonds: During extreme recession or to avoid deflation, Fed buys lots of government debt so as to inject more money into economy and encourage growth. This is called as Quantitative Easing as was pioneered by the Bank of Japan in 1990s. In 2009 financial crisis, Fed and Bank of England purchased over $2 trillion and $520 billions government debt respectively.
Comprehensive Financial Reform Acts:

After 1929 stock market crash, Glass Steagall Act was enacted which separated investment banking from commercial/retail banking to prevent bank run. This act was repealed in 1999 by the Gramm-Leach-Bliley Act allowing the banks to invest depositors' funds in unregulated derivatives. This led to banks becoming too big to fail and requiring bailout in 2008-2009 to avoid another depression.
After 2008-2009 housing crisis, Dodd-Frank Act was passed which increased regulations on unregulated derivatives trading and created Consumer Finance Protection Bureau that took actions against predatory student lending practices and payday loan scams. Much of these regulations might be rolled back Financial CHOICE Act is enacted.

Stagflation (= Inflation + Recession at the same time) is a condition of slow economic growth and relatively high unemployment accompanied by a rise in prices, or inflation. It occurs due to supply shock (such rapid increase in price of oil) and due to policies that harm industry while growing money supply too quickly. Venezuela is an example of stagflation where if you fix inflation, you create depression and if you fix recession, you create hyperinflation.
When inflation rate increases by ~ 50% monthly, it is referred to as hyperinflation. Germany after Treaty of Versailles (322% inflation rate), Hungary in 1945 (19,000% inflation rate) and Zimbabwe in 2008 (230 million % inflation rate) had some of the examples of hyperinflation.

Economist track the economic growth, unemployment and inflation using the following three measurements:

1. Gross Domestic Product (GDP): is the value of all the finished goods and services (hence product) produced within a country's border (hence domestic) in a specific time period, usually a year.

- Since it is "gross", the depreciation in the capital asset is not deducted. If it where deducted, then it becomes "net".

- Since it is "domestic", it does not take into account the country's earning outside its geographical boundaries, or foreign remittances. Gross National Product (GNP) is GDP + income earned by residents from overseas investments - income earned within the domestic economy by overseas residents.

- Since it is "product", the intermediate goods are not taken into account. For example, wheat sold for final consumption to consumers will be included into GDP, but not the amount of wheat sold to bakeries for the production of bread.

- GDP is commonly used as an indicator of the economic health of a country and can be computed in three principal ways: from total output, from income and from expenditure.

2. Unemployment rate: is the number of unemployed people is divided by the number of people in the labor force times 100. Note, it defines unemployed people as those who are willing and available to work, and who have actively sought work within the past four weeks. Hence, little kids don't count, neither do people who are unable to work due to their age or disability or who chose not to work. The below figure shows the unemployment rate per country in 2013:

3. Inflation rate: measured by consumer price index (or CPI).

Reference:

1. Economics - Hubbard and O'Brien.
2. https://www.investopedia.com/
3. The Book of Money - Conaghan and Smith.
4. Macroeconomics: Crash Course: https://www.youtube.com/watch?v=d8uTB5XorBw
5. The Monetary System Visually Explained: https://www.youtube.com/watch?v=23DNe0cJhcU
6. https://www.thebalance.com/open-market-operations-3306121
7. https://www.bankrate.com/rates/interest-rates/prime-rate.aspx
8. https://www.thebalance.com/dodd-frank-wall-street-reform-act-3305688

How to debug the unicode error in Py4J

2017-11-30T14:09:00.001-05:00

Many Java-based big data systems such as SystemML, Spark's MLLib, CaffeOnSpark, etc that use the Py4J bridge in their Python APIs often throw a cryptic error instead of detailed Java stacktrace:

py4j.protocol.Py4JJavaError: < exception str() failed >

For more details on this issue, please refer to Python's unicode documentation.

Luckily, there is a very simple hack to extract the stack trace in such a situation. Let's assume that SystemML's MLContext is throwing the above Py4J error: ml.execute(script). In this case, simply wrap that code in the following try/except block:

Git notes for SystemML committers

2017-10-16T19:16:00.000-04:00

Replace niketanpansare below with your git username:

1. Fork and then Clone the repository

https://github.com/niketanpansare/systemml.git

2. Update the configuration file: systemml/.git/config

[core]
symlinks = false
repositoryformatversion = 0
filemode = false
logallrefupdates = true
[remote "origin"]
url = https://github.com/niketanpansare/systemml.git
fetch = +refs/heads/*:refs/remotes/origin/*
[branch "master"]
remote = origin
merge = refs/heads/master
[remote "upstream"]
url = https://github.com/apache/systemml.git
fetch = +refs/heads/*:refs/remotes/upstream/*
[remote "apache"]
url = https://gitbox.apache.org/repos/asf/systemml.git
fetch = +refs/heads/*:refs/remotes/apache/*

The above configuration states that: "origin" points to your fork and "upstream" points to the github mirror of the main repository, which is referred to as "apache".

Optional but recommended: Update the ~/.gitconfig file:

[user]
name = Niketan Pansare
email = npansar@us.ibm.com
[alias]
systemml-pr = "!f() { git fetch ${2:-upstream} pull/$1/head:pr-$1 && git checkout pr-$1; }; f"

3. Now, let's you created a branch 'foo' with 3 commits. You can create a PR via github website

4. Once the PR is approved and if you are the committer, you can merge the PR:

git checkout foo
git reset --soft HEAD~3 && git commit
git pull --rebase apache master
git checkout master
git pull apache master
git merge foo
gitk
git log apache/master..
git commit --amend --date="$(date +%s)"
git push apache master

Let's discuss the above commands in more detail:
- git checkout foo: ensures that you are working on the right branch.
- git reset --soft HEAD~3 && git commit: squashes all your commits into one single commit in preparation for the merge.
- git pull --rebase apache master: rebases your local master branch with the apache.
- git checkout master: switch to the local master branch
- git merge foo: merge the branch foo with local master
- gitk: to double-check visually if local and remote branches are all in synch and if the commit is not creating weird loops.

- git log apache/master..: double-check if the log makes sense and if you have added "Closes #PR." at the end of the log and correct JIRA number as the header.
- git commit --amend --date="$(date +%s)": if not, allows you to modify your log message.
- git push apache master: go ahead and push your commit to apache :)

5. If you are a committer and want to merge a PR (let's assume PR 100), you need to create a local branch using the following command:

git systemml-pr 100
git checkout pr-100

Also, if you had to squash the commits as described above, please use the below command to amend the author:

git commit --amend --author="FirstName LastName <contributor@blah.com>"

6. If you have modified the docs/ folder, you can update the documentation:

git subtree push --prefix docs apache gh-pages

Before, updating the documentation, you can sanity test on your branch by:

git subtree push --prefix docs origin gh-pages

If the above command fails, you can delete the gh-pages branch and re-execute the command:

git push origin --delete gh-pages

Advanced commands:
1. Cherry-picking:

git log --pretty=format:"%h - %an, %ar : %s" | grep Niketan
git checkout gpu
git cherry-pick

Notes from other committers:
Deron: https://gist.github.com/deroneriksson/e0d6d0634f3388f0df5e#integrate-pull-request-with-one-commit-into-apache-master
Mike: https://gist.github.com/dusenberrymw/78eb31b101c1b1b236e5

Using Google's TensorFlow for Kaggle competition

2015-11-13T15:26:00.000-05:00

Recently, Google Brain team released their neural network library 'TensorFlow'. Since Google has a state-of-the-art Deep Learning system, I wanted to explore TensorFlow by trying it out for my first Kaggle submission (Digit Recognition) . After spending few hours getting to know their jargons/APIs, I modified their multilayer convolution neural network and came 140th (with 98.4% accuracy) ... not too shabby for my first submission :D.

If you are interested in outscoring me, apply cross-validation on the below python code or may be consider using ensembles:

from input_data import *
import pandas

class DataSets(object):
 pass

mnist = DataSets()
df = pandas.read_csv('train.csv')

train_images = numpy.multiply(df.drop('label', 1).values, 1.0 / 255.0)
train_labels = dense_to_one_hot(df['label'].values)

#Add MNIST data from Yan LeCun's website for better accuracy. We hold out test, just for accuracy sake, but could have easily added it :)

mnist2 = read_data_sets("/tmp/data/", one_hot=True)
train_images = numpy.concatenate((train_images, mnist2.train._images), axis=0)
train_labels = numpy.concatenate((train_labels, mnist2.train._labels), axis=0)

VALIDATION_SIZE = 5000
validation_images = train_images[:VALIDATION_SIZE]
validation_labels = train_labels[:VALIDATION_SIZE]
train_images = train_images[VALIDATION_SIZE:]
train_labels = train_labels[VALIDATION_SIZE:]

mnist.train = DataSet([], [], fake_data=True)
mnist.train._images = train_images
mnist.train._labels = train_labels

mnist.validation = DataSet([], [], fake_data=True)
mnist.validation._images = validation_images
mnist.validation._labels = validation_labels

df1 = pandas.read_csv('test.csv')
test_images = numpy.multiply(df1.values, 1.0 / 255.0)
numTest = df1.shape[0]
test_labels = dense_to_one_hot(numpy.repeat([1], numTest))
mnist.test = DataSet([], [], fake_data=True)
mnist.test._images = test_images
mnist.test._labels = test_labels

import tensorflow as tf
sess = tf.InteractiveSession()
x = tf.placeholder("float", [None, 784])   # x is input features
W = tf.Variable(tf.zeros([784,10]))           # weights 
b = tf.Variable(tf.zeros([10]))                    # bias
y_ = tf.placeholder("float", [None,10])   # y' is input labels
#Weight Initialization
def weight_variable(shape):
  initial = tf.truncated_normal(shape, stddev=0.1)
  return tf.Variable(initial)
  
def bias_variable(shape):
  initial = tf.constant(0.1, shape=shape)
  return tf.Variable(initial)
  
# Convolution and Pooling
def conv2d(x, W):
  return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')

def max_pool_2x2(x):
  return tf.nn.max_pool(x, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')
  
# First Convolutional Layer
W_conv1 = weight_variable([5, 5, 1, 32])
b_conv1 = bias_variable([32])
x_image = tf.reshape(x, [-1,28,28,1])
h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1)
h_pool1 = max_pool_2x2(h_conv1)

# Second Convolutional Layer
W_conv2 = weight_variable([5, 5, 32, 64])
b_conv2 = bias_variable([64])
h_conv2 = tf.nn.relu(conv2d(h_pool1, W_conv2) + b_conv2)
h_pool2 = max_pool_2x2(h_conv2)

# Densely Connected Layer
W_fc1 = weight_variable([7 * 7 * 64, 1024])
b_fc1 = bias_variable([1024])
h_pool2_flat = tf.reshape(h_pool2, [-1, 7*7*64])
h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1)

# Dropout
keep_prob = tf.placeholder("float")
h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob)

# Readout Layer
W_fc2 = weight_variable([1024, 10])
b_fc2 = bias_variable([10])
y=tf.nn.softmax(tf.matmul(h_fc1_drop, W_fc2) + b_fc2)
cross_entropy = -tf.reduce_sum(y_*tf.log(y))

train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)
tf.initialize_all_variables().run()
for i in range(20000):
 batch = mnist.train.next_batch(50)
 train_step.run(feed_dict={x: batch[0], y_: batch[1], keep_prob: 0.5})

correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))
print "test accuracy %g"%accuracy.eval(feed_dict={x: mnist2.test.images, y_: mnist2.test.labels, keep_prob: 1.0})
prediction = tf.argmax(y,1).eval(feed_dict={x: mnist.test.images, keep_prob: 1.0})

f=open('prediction.txt','w')
s1='\n'.join(str(x) for x in prediction)
f.write(s1)
f.close()

The above script assumes that you have downloaded train.csv and test.csv from Kaggle's website, input_data.py from TensorFlow's website and installed pandas/TensorFlow.

Here are few observations based on my experience playing with TensorFlow:

TensorFlow does not have an optimizer:

TensorFlow statically maps an high-level expression (for example "matmul") to a predefined low-level operator (for example: matmul_op.h) based on whether you are using CPU or GPU enabled TensorFlow. On other hand, SystemML compiles a matrix multiplication expression (X %*% y) into one of many matrix-multiplication related physical operators using a sophisticated optimizer that adapts to the underlying data and cluster characteristics.
Other popular open-source neural network libraries are Caffe, Theano and Torch.

TensorFlow is a parallel, but not a distributed system:

It does parallelize its computation (across CPU cores and also across GPUs):
Google has released only the single-node version and kept distributed version in-house. This means that the open-sourced version does not have a parameter server.

TensorFlow is easy to use, but difficult to debug:

I like TensorFlow's Python API and if the script/data/parameters are all correct, it works absolutely fine :)
But, if something fails, the error messages thrown by TensorFlow are difficult to decipher. This is because the error messages point to a generated physical operator (for example: tensorflow.python.framework.errors.InvalidArgumentError: ReluGrad input), not to the line of code in the Python program.

TensorFlow is slow to train and not yet robust enough:

Here are some initial numbers by Alex Smola comparing TensorFlow to other open-source deep learning systems:

Plotting as a useful debugging tool

2015-02-17T12:40:00.003-05:00

While writing the code for bayesian modeling, you will have to test the distribution (prior, likelihood or posterior). Here are two common scenarios that you might encounter:

You want to test the function that generates random deviates. For example: you have derived a conjugate formula for a parameter of your model and want to test whether it is correct or not.
You want to test a probability density function. For example: likelihood or posterior function that you might want to run rejection sampler on.

In both these cases, you will start by making sure the property of the distribution (for example: range, mean, variance) are correct. For example: if the parameter you are sampling is variance, then you will have “assert(returnedVariance > 0)” in your code. Then, the next obvious test should be visual inspection (trust me, it has helped me catch more bugs than I would by traditional programming debugging techniques/tools). This means you will plot the distribution and see if the output of your code makes sense.
We will start by simplifying the above two cases by assuming standard normal distribution. So, in the first case, we have access to “rnorm” function and in second case, we have access to “dnorm” function of R.
Case 1: In this case, we first collect random deviates (in “vals”) and then use ggplot to plot them:

library(ggplot2)
vals = rnorm(10000, mean=0, sd=1)
df = data.frame(xVals=vals)
ggplot(df, aes(x=xVals)) + geom_density()

The output of above R script will look something like this:

Case 2: In this case, we have to assume a bounding box and sample inside that to get “x_vals” and “y_vals” (just like rejection sampling):

library(ggplot2)
x_vals=runif(10000,min=-4,max=4)
y_vals=dnorm(x_vals, mean=0, sd=1)
df = data.frame(xVals=x_vals, yVals=y_vals)
ggplot(df, aes(x=xVals, y=yVals)) + geom_line()

The output of above R script will look something like this:

Just as a teaser to a post that I will post later, we can use the script somewhat similar to that of case 1 to study the characteristics of a distribution:

How to setup “passwordless ssh” on Amazon EC2 cluster

2015-02-17T12:30:00.003-05:00

Often for running distributed applications, you may want to setup a new cluster or tweak an existing one (running on Amazon EC2) to support passwordless ssh. For creating a new cluster from scratch, there are lot of cluster management tools (which is beyond the scope of this blogpost). However, if all you want to do is setup “passwordless ssh” between nodes, then this post might be worth your read.
The script below assumes that you have completed following three steps:

Step 1. Created RSA public keypair on each of the machine:

cd ~
ssh-keygen -t rsa

Do not enter any paraphrase, instead just press [enter].

Step 2. Suppressed warning flags in ssh-config file:

sudo vim /etc/ssh/ssh_config
StrictHostKeyChecking no
UserKnownHostsFile=/dev/null

Step 3. Copied the key pair file “MyKeyPair.pem” to master’s home directory:

scp -i /local-path/MyKeyPair.pem /local-path/MyKeyPair.pem ubuntu@ec2-master-public-address.compute-1.amazonaws.com:~

Assuming that above three steps have been completed, run this script on the master to enable passwordless ssh between master-slave and/or slave-slave nodes:



    #!/bin/bash
    # Author: Niketan R. Pansare
     
    # Password-less ssh
    # Make sure you have transferred your key-pair to master
    if [ ! -f ~/.ssh/id_rsa.pub ]; then
      echo "Expects ~/.ssh/id_rsa.pub to be created. Run ssh-keygen -t rsa from home directory"
      exit
    fi
     
    if [ ! -f ~/MyKeyPair.pem ]; then
      echo "For enabling password-less ssh, transfer MyKeyPair.pem to master's home folder (e.g.: scp -i /local-path/MyKeyPair.pem /local-path/MyKeyPair.pem ubuntu@master_public_dns:~)"
      exit
    fi
     
    echo "Provide following ip-addresses"
    echo -n -e "${green}Public${endColor} dns address of master:"
    read MASTER_IP
    echo ""
     
    # Assumption here is that you want to create a small cluster ~ 10 nodes
    echo -n -e "${green}Public${endColor} dns addresses of slaves (separated by space):"
    read SLAVE_IPS
    echo "" 
     
    echo -n -e "Do you want to enable password-less ssh between ${green}master-slaves${endColor} (y/n):"
    read ENABLE_PASSWORDLESS_SSH
    echo ""
    if [ "$ENABLE_PASSWORDLESS_SSH" == "y" ]; then
      # Copy master's public key to itself
      #cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
     
      for SLAVE_IP in $SLAVE_IPS
      do
          echo "Checking passwordless ssh between master -> "$SLAVE_IP
        ssh -o PasswordAuthentication=no $SLAVE_IP /bin/true
        IS_PASSWORD_LESS="n"
        if [ $? -eq 0 ]; then
            echo "Passwordless ssh has been setup between master -> "$SLAVE_IP
            echo "Now checking passwordless ssh between "$SLAVE_IP" -> master"
     
          ssh $SLAVE_IP 'ssh -o PasswordAuthentication=no' $MASTER_IP '/bin/true'  
          if [ $? -eq 0 ]; then
              echo "Passwordless ssh has been setup between "$SLAVE_IP" -> master"
            IS_PASSWORD_LESS="y"
          fi
        fi
     
        if [ "$IS_PASSWORD_LESS" == "n" ]; then
          # ssh-copy-id gave me lot of issues, so will use below commands instead
          echo "Enabling passwordless ssh between master and "$SLAVE_IP
     
          # Copy master's public key to slave
          cat ~/.ssh/id_rsa.pub | ssh -i ~/MyKeyPair.pem "ubuntu@"$SLAVE_IP 'mkdir -p ~/.ssh ; cat >> ~/.ssh/authorized_keys' 
          # Copy slave's public key to master
          ssh -i ~/MyKeyPair.pem "ubuntu@"$SLAVE_IP 'cat ~/.ssh/id_rsa.pub' >> ~/.ssh/authorized_keys
          # Copy slave's public key to itself
          ssh -i ~/MyKeyPair.pem "ubuntu@"$SLAVE_IP 'cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys'
     
        fi    
      done
     
      echo ""
      echo "---------------------------------------------"
      echo "Testing password-less ssh on master -> slave"
      for SLAVE_IP in $SLAVE_IPS
      do
        ssh "ubuntu@"$SLAVE_IP  uname -a
      done
     
      echo ""
      echo "Testing password-less ssh on slave -> master"
      for SLAVE_IP in $SLAVE_IPS
      do
        ssh "ubuntu@"$SLAVE_IP 'ssh ' $MASTER_IP 'uname -a'
      done
      echo "---------------------------------------------"
      echo "Sorry, prefer to keep this check manual to avoid headache in Hadoop or any other distributed program."
      echo -n -e "Do you see error or something fishy in above block (y/n):"
      read IS_ERROR1
      echo ""
      if [ "$IS_ERROR1" == "y" ]; then
        echo "I am sorry to hear this script didn't work for you :("
        echo "Hint1: Its quite possible, slave doesnot contain ~/MyKeyPair.pem"
        echo "Hint2: sudo vim /etc/ssh/ssh_config and add StrictHostKeyChecking no and UserKnownHostsFile=/dev/null to it"
        exit
      fi
    fi
     
    echo -n -e "Do you want to enable password-less ssh between ${green}slave-slave${endColor} (y/n):"
    read ENABLE_PASSWORDLESS_SSH1
    echo ""
    if [ "$ENABLE_PASSWORDLESS_SSH1" == "y" ]; then
      if [ "$ENABLE_PASSWORDLESS_SSH" == "n" ]; then
        echo -n -e "In this part, the key assumption is that password-less ssh between ${green}master-slave${endColor} is enabled. Do you still want to continue (y/n):"
        read ANS1
        if [ "$ANS1" == "n" ]; then 
          exit
        fi
        echo ""
     
      fi
      for SLAVE_IP1 in $SLAVE_IPS
      do
        for SLAVE_IP2 in $SLAVE_IPS
        do
          if [ "$SLAVE_IP1" != "$SLAVE_IP2" ]; then
            # Checking assumes passwordless ssh has already been setup between master and slaves
              echo "[Warning:] Skipping checking passwordless ssh between "$SLAVE_IP1" -> "$SLAVE_IP2
            IS_PASSWORDLESS_SSH_BETWEEN_SLAVE_SET="n"
     
            # This will be true because ssh $SLAVE_IP1 is true
            #ssh $SLAVE_IP1 ssh -o PasswordAuthentication=no $SLAVE_IP2 /bin/true
            #if [ $? -eq 0 ]; then
     
     
            if [ "$IS_PASSWORDLESS_SSH_BETWEEN_SLAVE_SET" == "n" ]; then
              echo "Enabling passwordless ssh between "$SLAVE_IP1" and "$SLAVE_IP2
              # Note you are on master now, which we assume to have 
              ssh -i ~/MyKeyPair.pem $SLAVE_IP1 'cat ~/.ssh/id_rsa.pub' | ssh -i ~/MyKeyPair.pem $SLAVE_IP2 'cat >> ~/.ssh/authorized_keys' 
            else
              echo "Passwordless ssh has been setup between "$SLAVE_IP1" -> "$SLAVE_IP2
            fi
          fi
        done
     
      done
     
      echo "---------------------------------------------"
      echo "Testing password-less ssh on slave  slave"
      for SLAVE_IP1 in $SLAVE_IPS
      do
        for SLAVE_IP2 in $SLAVE_IPS
        do
          # Also, test password-less ssh on the current slave machine
          ssh $SLAVE_IP1 'ssh ' $SLAVE_IP2 'uname -a'
        done
      done
      echo "---------------------------------------------"
      echo "Sorry, prefer to keep this check manual to avoid headache in Hadoop or any other distributed program."
      echo -n -e "Do you see error or something fishy in above block (y/n):"
      read IS_ERROR1
      echo ""
      if [ "$IS_ERROR1" == "y" ]; then
        echo "I am sorry to hear this script didn't work for you :("
        echo "Hint1: Its quite possible, slave doesnot contain ~/MyKeyPair.pem"
        echo "Hint2: sudo vim /etc/ssh/ssh_config and add StrictHostKeyChecking no and UserKnownHostsFile=/dev/null to it"
        exit
      fi
    fi

Here is a sample output obtained by running the above script (src code available via link):

ubuntu@ip-XXX:~$ ./enablePasswordlessSSH.sh
Provide following ip-addresses

Public dns address of master:ec2-XXX-29.compute-1.amazonaws.com

Public dns addresses of slaves (separated by space):ec2-XXX-191.compute-1.amazonaws.com ec2-XXX-240.compute-1.amazonaws.com ec2-XXX-215.compute-1.amazonaws.com ec2-XXX-192.compute-1.amazonaws.com ec2-XXX-197.compute-1.amazonaws.com

Do you want to enable password-less ssh between master-slaves (y/n):y

Checking passwordless ssh between master -> ec2-XXX-YYY.compute-1.amazonaws.com
bunch of warning and possibly few "Permission denied (publickey)." (Ignore this !!!)

---------------------------------------------
Testing password-less ssh on master -> slave
Don't ignore any error here !!!
---------------------------------------------
Sorry, prefer to keep this check manual to avoid headache in Hadoop or any other distributed program.
Do you see error or something fishy in above block (y/n):n

Do you want to enable password-less ssh between slave-slave (y/n):y

---------------------------------------------
Testing password-less ssh on slave <-> slave
Don't ignore any error here !!!
---------------------------------------------

Sorry, prefer to keep this check manual to avoid headache in Hadoop or any other distributed program.
Do you see error or something fishy in above block (y/n):n

Warning: The above code does not check for passwordless ssh for slave-slave configuration and sets it blindly even if passwordless ssh is already enabled. It might not be big deal if you call this script few times but won’t be ideal if the cluster needs to dynamically modified over and over again. Still, to modify this behavior, look at the line: IS_PASSWORDLESS_SSH_BETWEEN_SLAVE_SET=”n”

Notes on entrepreneurship (Part 3)

2014-03-08T00:09:00.003-05:00

Few updates since the last time I wrote the blog:
1. At the end of the entrepreneurship course, the website and an iOS app was implemented and I got an A grade ... yaay !!!

2. Since I was out-of-country during the final presentation to judges (VCs, Startup founders, etc), I created a youtube video to motivate this project and my teammate explained the remaining logistics of our project [10].

3. Vigilant team won 3rd place in HackRice 2014 competition for developing Android version along with SVM classifier [3] to predict threat level based on location and time of the day. We used crime data from Houston PD for 5 years, so yes this feature only worked for Houston :(

4. I was lucky enough to get selected to participate in Ignite Conference 2014 [4]. To get the ball rolling on the first day, we were asked to build as tall structure as possible in given time frame with marshmallow on the top using noodle sticks, tape and thread [5]. The next day, we got to visit Khosla Ventures and Square (and few other startups). On last two days, we had talks from several guest entrepreneurs where they spoke about their entrepreneurship journey. Here are few of the lessons I learnt (directly/indirectly from this conference):
4.a What ideas/projects to work on ?
- One of the most common advice you will receive is "fail fast, fail often and fail cheap". However, I suggest not to take it verbatim and understand that the emphasis is not on failing but learning. So, you should not go with the mindset that if I am anyways planning to fail early, let me start with a dumb idea. In fact, the mindset should be "test early/cheaply; if you do fail, don't get hung over it, instead learn quickly why did you fail; and then adapt". Let me put this in much broader context in terms of one of the most important lesson I have learnt in research and entrepreneurship: though most people will judge you by your bank balance and your citations, you (as a researcher or entrepreneur) should try to evaluate yourself by your ability to keep failures small and useful.
- Don't hesitate or hold yourself back if you are passionate about an idea that appears to be "black swan". But makes sure you "know what you know and know what you don't know". Then, evaluate whether things you don't know will halt the project, if so, find an expert who can help you with that.
- Before you think of quitting your job and start a venture, make a list of non-glamorous/routine stuff about startup you will have to do. Though many of these might be solvable by capital, if you do intend to go the lean startup way, ask yourself: Are you mentally flexible to get over your ego and do these task if need be ?
4.b Building a team and network:
- Build team that is smarter than you. And make sure you give back and focus on their growth too, else you will soon face with talent retention problem.
- Here are some of the key points to build and keep great teams: Minimize unnecessary bureaucracy [14], exercise merit-based promotion strategy, treat others with respect and most importantly, with respect to improvement of your startup, encourage everyone to pitch-in their ideas rather than giving everyone list of to-do items [15].
- Hire a really good sales team you can find. This is especially true in internet technology though it has scalability challenges. Why? People would much rather prefer to deal with a sales person rather than internet widget when they are paying huge money for the given service. This also means don't wait until your product testing phase to hire a sales team. Rather start with integrated sales-development team and use sales part of the team as to validate/refine the hypothesis by communicating with customers during the product development phase. Best case scenario, the founder is that person ... that first assumes the role of salesman, talks to clients/customers, figures out requirements, then by himself or by hiring really good team develops the product, markets it, and then keeps on iterating the whole process again [8].
- Networking is extremely critical. It usually takes time, else it seems inauthentic/insincere. So, don't use the number of LinkedIn connections/visiting cards in your wallet as measure of networking, instead count number of people who genuinely want to help you and vice versa. Other way to put it: networking is not about how many people you are acquainted to but rather how many mentors, benefactors and friends you have.
4.c Managing yourself:
- It is extremely important to find ways to manage stress and take care of your health. Also, if your better-half and family understands the challenges of startups and supports/love you, the stress/difficulties reduces exponentially.
- Especially in software development phase, use productivity rather than hours worked as measure.
- Know the difference between persistence and stubbornness. Remember persistence (without stubbornness) requires learning and pivoting given new information/lessons, rather than doing same things over and over again expecting different results [6].
- When working in team, your idea is only good to the degree to which you can explain it. If your team doesn't understand it, you can almost be sure that your customer won't be able to understand it. This skill will probably require an entirely new blogpost (which I might write later), but for now here are three things you can do:
a. Read books about presentation/explanation/UI design: Design of everyday things, Don't make me think, Back of Napkin, Art of explanation, Whiteboard selling, slide:ology and 100 things every designer needs to know.
b. Learn empathy (i.e. relating to team/developers/sales and especially customers).
c. Be specific. If you don't break up tasks into smaller items and clarify the purpose/big picture, people (either your team or investor) might feel overwhelmed and can sometime get defensive. One particular thing that every member on your team needs to be very crystal clear about is "minimum viable product".

Let me recap the high-level steps [16] described until now in this blog series:

Step 0: Define your means: "what I know", "what do I have" and "whom do I know".

Step 1: Ideation: See part 1 of this series for high-level tips.
Step 1.1: Being little more specific about step 1: Create/Refine 60 second elevator pitch: The key idea is if you meet a key investor/evangelist/strategic partner in an elevator and if he asks you what are you working on, you should be able to summarize your business idea before the elevator reaches certain floor and that person has to leave. Here are few suggestions about creating your elevator pitch:
- Sit down and write your elevator pitch without too much thinking. First iteration will always be awful.
- Iterate, rehearse, iterate again and then rehearse again (and keep doing this) until your elevator pitch becomes second nature to you.
- Talk to your customers and experts in that field and try to incorporate their feedback.
- Be honest and don't promise "world peace". Remember, your words represents your thoughts/character and the investor is investing in you as much as he/she is investing in the idea.
- There is no one way to do this, so learn-by-example; aka do youtube/google search for "elevator pitch" and listen to them [7].
- Use analogy to simplify the problem.
- If possible, explain how the problem affects you, your family or someone you know.
- The first and last sentence are extremely important and connecting them helps to create a more effective delivery.
Here is one of the iteration of my elevator pitch:

Studies show that in the US alone, one out of four women get sexually assaulted in their life-time. This number is much higher when they travel to developing countries like India. Though this problem is highly complex, we believe that a fraction of it can be solved by technology-based deterrents that act as first line of defense when in danger. We present one such solution: Vigilant. Vigilant is an affordable and hassle-free personal safety solution that connects you to friends, family and authorities during emergency with just a click of a button. The goal of Vigilant is to make everyones life safer with the help of smarter software and hardware.

Step 2: Opportunity evaluation:
Step 2.1: Use high-level checklist described in part 1 of this series.
Step 2.2: Create a business model canvas: Before you read anything further, you should definitely watch this 2 min video describing the business model canvas. Also, read the book Business Model Generation to understand why and how to create a business model canvas. If you don't have time to read through the book, here is few websites that allows you to create one by step-by-step instructions: zoomstra, launchpad central, turbostart, wordpress plugin, collaboration website. You can also go through following lectures/tutorials to learn how to create one: Steve Blank's udacity class, Alex's blog.
Step 2.3: Early customer validation/requirement gathering using one or more of the feedback/analytics tools (some are applicable only at later stages): These tools help you quantify/find out: what people think they want, how much they think they want to pay and how much they will really pay.
- False buy page or dummy (but fully functional) e-commerce page for buying clickers [1].
- Contact form plugin in wordpress.
- Feedback button in app.
- Site Traffic widget/plugin in wordpress.
- Facebook likes.
- Google forms for suggestions from early adopters: Here are few suggestions for coming up with good questions (ones that I didn't know earlier): Kevin's blog, Stanford's videos for customer discovery, Learn startup customer development templates, Alex's blog, Steve's suggestions, Kaushik's suggestions.
- Other tools that you can use for market research are Google's customer survey tool, Amazon's mechanical turk, Qualtrics.
- Pamphlet with google url shortener[2] and QR codes across campus.
- Other tools that allow to do customer validation: validately, foundersuite.

Step 3: Build a great team !!!
- See suggestions given above in point 4.b and 4.c.
- There are few websites that help you find your founders: FoundersDating, CoFoundersLab or attend a meetup or hackthons or startup weekend.

Step 4 (or 5): Implementation, marketing, validation, refining business model canvas and keep on iterating.
This step requires above mentioned feedback/analytics tool as well as marketing/branding tools:
- Checkout fiverr for really cool and cheap marketing ideas.
- Make sure you install SEO plugin for your wordpress. I am using this plugin, but there are equally good plugins in the market.
- Create a marketing message/keywords based on above feedback tools or google trends or google keyword tool. Then keep refining it with respect to location based on google insights. Similarly, there are several analytics tools that will help you in marketing/branding like Google Analytics, Amazon Analytics, Heap Analytics, AppAnnie, Distimo, etc.
- Advertise on Google, Facebook, Bing or LinkedIn. If you prefer traditional marketing tools, you can look at srds.
- Read Art of explanation and make a user-friendly video about your product. You can use one of the following tools to create the video: Sparkol, Explainify, LooseKeys, Amazon's Elastic Transcoder, Camtasia.
For exhaustive list of tools, refer to Steve Blank's list, YCombinator's list, Startup Weekend's list, PBWork's list, HBS list.

Step 5 (or 4): Funding your startup:
Step 5.1: Research on which type of investor you want to approach: angel investors [12], venture capitalist, crowdsourcing or super angels. Also, be perfectly clear about amount of funding you need and valuation of your company as this will determine amount of equity (or other form of compensation) you will have to give to the investor. There are other parameters to consider as well like how much involvement you want from investor. Here is small figure to understand the startup funding lifecycle [11]:

Step 5.2: Now research on the specific investors you want to target and classify them as either "ideal/ambitious" or "can live with/moderate". Start with investor's website and understand their funding philosophy. Then look into ventures they have funded and see how they are doing [13].
Step 5.3: Prepare your pitch deck:
- See tips from from my previous blogpost about presentation.
- Here is a really good 2-min video about pitching from Guy Kawasaki. Note the 10 slides at 10min 4 seconds.
Step 5.4: Remember there are other things more important than just capital for your startup. Ask advice and hopefully gain a mentor or board of director. Sometimes they can also refer you to another funding agency.

Since this is often the most confusion aspect of startup for a techie, let me try my best to explain and emphasize the difference between validation and planning [8, 9]. Usually, planning involves deep thinking so as to perform forecasting, whereas validation starts with brainstorming the hypothesis and then validating it through rapid customer feedback loop. Planning which is associated with waterfall-like model, starts with defining the specific goal and then coming up with strategies and detailed steps to move towards that goal. Validation, on other hand, is often associated with agile development/effectuation/lean startup methodology and then involves pivoting the goal based on customer/market feedback. Planning assumes that you know precisely what customer requirements are beforehand, validation doesn't. Since customer requirements are known, planning is suitable for large-scale established companies where the emphasis is on "execution" of those requirements. On other hand, since precise customer requirements are unknown in startups, emphasis is on "searching" of the requirements through validation of hypothesis [8]. This is why many entrepreneurs prefer a short business model canvas (which is a hypothesis generating and validating tool) over more comprehensive business models.

Reference:
[1] Implementing e-commerce page on the wordpress took me no more than 30 minutes using the plugin WooCommerce. It allows you to track orders, generate reports, and even manage coupons.
[2] Google's url shortener (and many more similar free services) helps you track the traffic through the given link. Here is a useful hack: Have multiple shortened urls for your webpage and use different ones for pamphlets in different geographical areas. Then use the analytics to do much more targeted ad campaign :)
[3] For ML geeks, I must confess that the model suffers from selection bias as it only contains the data when crime occurred, not when the crime didn't occurred. Since you have to finish the project in a day according to rules of hackathon, most of which were spent on data gathering/cleaning, I had no time to work on this :(
[4] One thing that surprised me was that even though this conference was in Silicon Valley, I was the only Computer Science graduate student there, all others were either MBAs or MDs or MD-PhDs or BioTech students.
[5] My team came second and built the structure that was 26 inch tall. The idea is to start the first structure with marshmallow and keep building lower structure. Also, it is important to reinforce the structure (building supporting vertical sticks and choosing multiple sticks for lower pillars) so that it does not collapse. Most of the teams that failed started by building as tall structure as they could without marshmallow and then tried to put marshmallow on the top, which caused entire structure to collapse because of the weight. So the take-away point for entrepreneurs from this exercise is do bottom-up incremental iterations/updates with customer involvement so as to mitigate risk. Here is the pic of our structure:

[6] From Mark Otero, CEO & Co-founder, Klicknation.
[7] Daniel Pink in his book "To sell is human" suggest following template: "Once upon a time ___. Every day, ___. One day ___. Because of that ___. Because of that ___. Until finally ___".
[8] Read Steve Blank's blog for more detail.
[9] http://www.effectuation.org/learn/effectuation-101
[10] In the video, the clip where people have duck-tape on their mouth is from another youtube video about raising voice against sexual violence, i.e. it was not created by me and hence don't give me credit if you want to cite it. I used the clip because it resonates with the motivation of my project :)
[11] http://en.wikipedia.org/wiki/Seed_round
[12] It might also be worthwhile to look into websites like angellist.
[13] Websites like CrunchBase and Techcrunch helps here. Other useful website you should definitely visit is Glassdoor to gauge the talent-retention ratio as well as willingness of upper-management to experiment new ideas. Why? It sometimes can be a metric to evaluate the founders and indirectly the investors. There are few ranking websites based on different criteria, one of them being Forbes Midas' list.
[14] For example: Rice Computer Science Graduate Student Association (CS GSA) is a flat organization and there are coordinators that have responsibilities/duties. Everyone can pitch in and help others, but when it comes to arbitration/point of contact/failure responsibility, the assigned coordinator is the one in-charge. For people outside the organization, that person is Overall coordinator, think of it as a President of the club and for the current year, that's me (thank you very much) :)
[15] Sure, some ideas might not align with the vision of the company and it may be dropped, but it has to be done in form of discussion rather than in an authoritative manner.
[16] Note, these are high-level steps, even though you might find the list of tools overwhelming. For detailed step-by-step instructions regarding startup process, please read one of several startup related books available on amazon or listen to startup lectures on udacity/coursera/youtube or better attend an incubator program near you.

Notes on entrepreneurship (Part 2)

2013-10-12T16:30:00.000-04:00

In the first class, we divided ourselves into random groups, each with four people (mostly from different background). The idea was to pitch an startup idea within 10 minutes based on the means of our group members. This exercise was repeated again with different set of people. One thing I noticed after this exercise was even though it seems like nice idea to start with your means and build a startup based on that, people usually revert back to the idea that they are absolutely passionate about, irrespective of their means [8].

As an assignment that week, we were supposed to take $5 and in span of two hours make as much money as we can. My group decided to go with "Grocery delivery service for professors" and each of us were supposed to ask professors in our department to help with that. I got to speak to only few professors in Computer Science department (as it was Friday) and only Swarat was on-board with the idea. My team-mates were unable to get any professors from their respective department, so they decided to go with another idea: "Personalized cards and their delivery". We made $10 in tips with the first idea and ~$13 with the second idea. The lesson from this exercise was sales is hard, but probably the important part of a startup [9].

In the next class, each of us gave an elevator pitch of their idea for a startup. Let me put my idea with respect to previous post:
1. Means:
- What do I know ? Background in software development, research experience in building large-scale systems and knowledge of embedded systems.
- What do I have ? Very low capital ... student salary :(
- Whom do I know ? Software professionals in India and US (from my bachelors, masters and job experience), Marketing/Advertising professionals in India (my father's advertising firm and my MBA friends), Trustworthy partner in India (my best friend and brother Rohan), Research scientists (my advisor, my colleagues at Rice university [3], my collaborators, contacts from internships and also from my experience as President of Rice Computer Science Graduate Student Association). Other than couple of exceptions, until now I have been lucky enough to be surrounded by really nice people, which is why I believe this to be my strongest means [4].

2. Ideation:
In the previous blogpost, I vented my frustration over sexual assaults cases in India as well as provided high-level suggestions (which I must admit I had no control over). So, I decided to use my means to develop something that might help improve the situation (in whatever little way possible).
Unlike US, the commonly-accepted safety net in India is not government/police, but the social structure (i.e. your friends and family). Many of the personal safety solutions like pepper spray, taser or guns require state licenses in many countries and are even prohibited in few countries like Canada, China, Bangladesh, Singapore, Belgium, Denmark, Finland, Greece, Hungary, etc. Not to mention they are expensive and can often escalate the situation. This is why many people follow a self-imposed curfew, that is either not leave home after dark or have someone accompany you.
With advent of smartphones, an obvious solution seems to be "have a personal safety app". Most of these apps have cool features such as sending GPS location to your friends and family but have a major flaw that make them non-functional: they require you to take out your phone, open the app and press an SOS button in it. Clearly, this is not what an average person would do in times of danger. Here are some of the feedback/comments on such apps [1]:
- I think a "dangerous environment" is the last place I would want an expensive item like an iPhone prominently displayed.
- ... app (might not be) readily available to access, on screen #8 somewhere on the 2nd or third row ...
- Pulling out your iPhone in the face of an attacker? That's one sure way to escalate the situation. And he now knows you have a nice phone too.
Reality check: Not only I am passionate about this problem, there does not exists a solution that works satisfactorily (i.e. "pain points") ... to me, they are just cosmetic app, the ones you buy for keepsake but does not serve any purpose to the society.

Before discussing next point, let me take a small detour and tell you what metric I use to determine the success of this startup [2]: At the end of the course (or may be few months past that), I am able to make an app that works seamlessly in real-life situation and which I can recommend to my loved ones without any hesitancy.

3. Opportunity evaluation:
Like many computer programmers, I have an habit of developing software that do exactly what I want but not what my audience would need. To ensure that it doesn't happen, I sent out an survey asking people what they would like in during times of emergency. The demographics of participants were as follows:
- 53% males and 47% females
- 82% of participants were between 20-30 years old and 14% between 30-40 years old.
- 59% from US, 16% from India and 8% from China
- 24% had Bachelors degree, 47% had Masters degree, 27% were PhD students.
- 32% earned between $10K-50K, 28% earned between $50K-100K, 16% earned above $100K and 20% were not currently earning.

Here is the summary of responses:

The above figures shows that majority of people (46%) wanted an external device like bluetooth-clicker that victim can press when he/she is in danger. This was a good indicator that I should go ahead and spend some time building such an app.

The next step was to understand what features a user would want/need in times of emergency:

Since some of these features require backend services (for sending emails/SMS, managing account/app) which need capital (I am not rich enough to maintain this kind of service on my own), I asked how much people are willing to pay for this kind of service. It was clear that most people preferred buy-once kind of a model:

Now, the checklist of opportunity evaluation:
a. Unique value proposition of my idea: "Affordable" and "hassle-free" way to connect to your friends, family and authorities in times of emergency ... with just a click of button.
b. Is it defensible: Nope, anyone with strong programming experience can replicate the features of my app. In fact, in long run, that is exactly what I want ... lot of good apps and competition that eventually help reduce the number of sexual assaults and make the world little safer place.
c. Is it profitable/sustainable: I really don't know :( ... It could be a product like dropbox, which people never thought they would want ... but once they got it, they can't imagine their life without it ... or it could be a total flop. The only way to be completely sure is by implementing it :)
d. Clearly defined customer: Young women traveling abroad or working late, senior citizens and frequent travelers.
e. Is it feasible and scalable: Yes, I used my experience in building large-scale cost-effective systems to build the backend. Also, my experience in C/C++ programming, knowledge of design patterns and user-friendly Apple documentation helped me: (a) learn the basics of Objective C in less than a week and (b) build the version 1.0 of the iOS app in less than a month.

Finally, I must reiterate the core principle of opportunity evaluation for a startup: It is not possible to know a priori whether an idea will turn out to be good business or not. So, I decided to stick to "affordable loss" principle and develop the app with as low cost as I could. Few of the hacks I used to ensure "affordable loss" are as follows:
- Moving most of the computing to clients rather than server (so as not to buy overly expensive servers).
- Using pay-as-you-use services wherever I felt absolutely necessary (for example: Amazon web-services).
- Using GIMP to develop my own logo (which I had to learn btw :P) rather than hiring a designer [6].
- Using wordpress for the startup website rather than spending days perfecting the CSS to make it mobile-compatible or hiring a web-developer.
- Focusing on minimum viable product (using Texas Instrument's SensorTag) rather than prematurely buying tons of bluetooth clicker from China [12].
- Using in-app purchases rather than setting up credit-card system in my website to provide features which other services charge me (for example: SMS/Email) [10].
- Not running after patents early on in the venture [5].
- Buying readily-made icons set rather than designing them yourselves [7].
- Choosing a hosting plan that has no hidden fees and that supports your choice of backend services.
- Using gmail as support email (rather than one provided by hosting services) and adding feedback button in app as well as contact form on the website (with some kind of captcha [13]). One tip I have for new developers is try to minimize the number of clicks/typing in app for sending feedbacks, for example: pre-fill "to-" address as well as "subject line".
- Utilizing the membership benefits of Apple/Google development program, i.e. off-loading testing [11], advertising/cross-promotion, expert feedback, reliable delivery of your software and version management.

References / footnotes:
[1] http://reviews.cnet.com/8301-19512_7-57452393-233/iwitness-app-claims-to-be-ultimate-deterrent-to-crime/
[2] Rather than use metrics like expected profit/revenue after first year or public offering or something on similar lines.
[3] I already have got 4 other PhD students from Rice university on-board to develop Android/Windows version of the app. This is in lines with another principle of Effectual Entrepreneurship: Form partnerships.
[4] The intent of this statement is not flattery, just an observation. Here is why I think so: though I never deliberately tried to enforce it, my circle of influence consists of three types of people: those who are genuinely nice, those who are smarter than me, and non-self-destructive people. Of course, the categories are not mutually exclusive and in fact people who are smarter are usually genuinely nice (probably because they only focus on self-upliftment, not on pulling others down). Here is a layman example most people can relate to: even in course with relative grading, a smart person knows that he/she is a student of global class and is not be threatened by his/her class-mates' progress ... which is probably why you will rarely find such a student shying away from group discussion (so as to gain advantage) or deliberately spreading false information to sabotage other's grades. May be things might not be such black-and-white in other fields, which is one of the reason why I absolutely love research and development.
[5] A patent attorney who attended our class thought one of the feature of our app is patentable, which might be true. But there are 2 problems with going that way: (a) the real cost of patent is not in getting it, but in defending it, (b) It will limit the features that other programmers who are smarter than me can introduce in their personal safety app (which goes against my success criteria for this particular problem). To be completely honest, there are ways around low-capital issue for those startups who really want to get a patent: (1) file provisional patent yourself under $200 (just read about how to define the scope of your patent and also about court dates), (2) ask your parent organization or angel investor to file a patent for you in exchange for royalties or stake in the startup (for example: Rice university's OTT office), (3) contact a patent troll to defend you patent.
[6] For people with little more budget, there are websites like http://99designs.com/ and https://www.elance.com/ that allows you to hire a free-lance designers/developers. A similar website for building mock prototypes of your products to show to investor is http://www.protomold.com/.
[7] There is always a tradeoff between time and money, you just have to figure out what is your exchange rate for time ;) ... for example: if someone is providing you service that will save you 1 hour, how much are you willing to pay for that service.
[8] Whether building your startup "based on your existing means" is better than "based on your passion and then expanding your means as you go" or vice versa, I really don't know. There are obvious advantages for both and also obvious disadvantages when pushing the respective principle to extreme.
[9] Even though you might think an idea is pretty good (and will benefit the customer), people are not willing to pay as much as you think ... probably because it's either suspicion that you are trying to dupe them or incorrect valuation of the product/market from your end or something else.
[10] Though it might seem simple to just plug-in existing credit-card library and setup a php page with mysql backend, things get a little complicated when you start thinking about transactional semantics and the fact that people can buy new devices or deleting the app and similar situations.
[11] In traditional company, a developer would be evaluated based on stupid metrics, such as bugs assigned, solved, etc. Though on paper they seem apt, they can have harmful side-effects for startup such as spending too much time perfecting a feature without any customer validation/feedback. So, instead of spending significant resources on testing, submit your app to Apple as and when you add new feature and indirectly ask them to test it :) ... Other way, for cheap testing is by using crowd-sourcing websites such as Amazon's mechanical turk (which I will ignore for this post).
[12] Since the SensorTag took a while to be delivered to my home address (thanks to my apt complex rejecting the package), I decided to use an accessory that I already owned for version 1.0 (i.e. headphones) and introduce bluetooth feature in the next version.
[13] There are lot of wordpress plugins that allow you to add captcha in your website in just few minutes.

Notes on entrepreneurship (Part 1)

2013-10-01T15:57:00.000-04:00

In this semester, I am taking ENGI 540, an entrepreneurship class offered by Rice Center for Engineering Leadership that focuses on lean startup and effectuation principles. It follows "reverse classroom" approach where we are supposed to go through preparatory readings and watch online lectures before every class. In the class, we discuss what we accomplished every week and also interact with guest entrepreneurs, venture capitalists and IP lawyers. I thought it might be nice to blog about my experience and share few insights I got from taking this class.

In the first class and after reading through first few chapters of Effectual Entrepreneurship, I found following concepts very intriguing:
1. Defining one's "means", which my professor defines as "who you are, what you know, what you have and who you know". The core idea in a startup is to start with your mean and create opportunities; sometimes expanding one's means through partnership.

2. Ideation: One interesting habit that a budding entrepreneur can practice is make an idea journal, where you force yourself to come up with 1 new idea every week (or if you are ambitious every day). There are various tools available for idea generation like mind mapping, lateral thinking, brainstorming, ... (will leave this for some other blogpost). Here are some of the tips for generating business ideas:
(2.a) Look into areas of change: Eg: opportunities in healthcare domain (because of ObamaCare) and energy field (because of increase in oil/gas prices & environmental impact). You can also analyze the consumption habits based on demographics [3], especially in market that is rapidly changing, i.e. BRIC countries (Brazil, Russia, India and China).
(2.b) Pain points: Is there a problem that is bothering you that is not solved by any existing solutions ?
(2.c) Cross-domain: Can you apply your expertise in one field into other ? For example: if you have expertise in end-user/consumer marketing, you can find an opportunity in a field which only focuses on corporate services (by delivering those services to end-user).

3. Opportunity evaluation: It is not possible to know a priori whether an idea will turn out to be good business or not. Most successful entrepreneurs and investors claim there is only one way to see good business opportunity: Go ahead, implement it creatively with very low levels of investments and either find real customers who are willing to buy the product or service at reasonable price or locate partners who are willing to commit real resources to the ventures early on or ideally both [1]. Still there are few high-level checklist that you can use to check if your idea is good or not:
(3.a) Does your idea have unique value proposition; if yes, what is it ?
(3.b) Is your idea profitable and sustainable (i.e. profitable over long period of time) ?
(3.c) Is your idea defensible ? People protect their idea either by patents/copyrights/trade-secrets, etc. Sometimes early entry into the market and strong user-base also makes an idea defensible.
(3.d) Does it have a clearly defined customer ?
(3.e) Other questions: Is it feasible ? Is it scalable ?

4. Causal reasoning states: To the extent we can predict the future, we can control it; whereas effectual reasoning states: to the extent we can control the future, we don't need to predict it. So, instead of creating elaborate predictive business model before beginning a startup (which can be frightening), it might be a nice idea to follow the effectual strategy: focus on what is controllable about the future and only commit what you can afford to lose (i.e. "affordable loss" principle [2]). For extremely risk-averse people who might prefer inaction, I would like to cite Sarasvathy's quote: "20 years from now, you will be disappointed with things you didn't do than things you did do".

Reference:
[1] Effectual Entrepreneurship - Saravathy et al.
[2] A good application of affordable loss principle for iPhone development is given in this youtube video.
[3] Procter & Gamble changed the delivery mechanism of shampoo in India ... small packets instead of large bottles.

Religion and Science

2013-05-13T16:49:00.000-04:00

When I first decided to write a blogpost on this topic, the first thought that struck me was "What can I write about religion and science that isn't already articulated by others ?". This was quickly replaced by another thought "If it is so, why majority of people have twisted views about them ?". The answer was pretty simple: In public venues or in mainstream media, people with extreme views are the ones who are most vocal, whereas moderates often seem unconcerned, callous or worse condescending.

Since moderates are often characterized as being tolerant (which is a good thing), let's try to understand what does it really mean to be tolerant. Is it same as being indifferent ? If you are tolerant, should you be vocal about it or is it sufficient to say "you believe what you want to believe, but don't bother me" [12] ? Should it be a choice or a requirement ? Would you be tolerant if you chose not to tolerate the intolerants (often called as terrorists, extremists, fascists, nazi, etc) [1]. Wait, it gets even more complex: Who defines the term "intolerant" ? For Islamic extremists, it's Western world stopping them from enforcing their interpretation of sharia law. For Western world, it's cowardly and inhuman attacks on civilians by these terrorists. For some, it's bible thumpers trying to dictate how they should live their life; for others, it's atheists (or may be liberals), who are ruining their family or moral values. The list goes on and on and is equally applicable in my beloved homeland India and gracious guest country USA (and I suspect every diverse nation in the world). Unfortunately, unlike Karl Popper I don't have a globally applicable answer to these questions [10]. A local answer is to be tolerant in terms of freedom of speech, but keep people responsible for their actions in terms of secular humanitarian laws [2]. Bottom-line: in a civil society, the simplest working principle seems to be that we educate ourselves (enough to be intellectually capable of deciding right from wrong), embrace the diversity, be tolerant of opposing views and let the law judge the people who we feel are intolerant. It might also help if we abandon our cloak of complacency, engage in serious and meaningful debates/discussions and elect representatives who share our thoughts. Before I come back to the original topic, let me explain the dangers of being intolerant: If you chose to accept so-called "little intolerance", you will soon find yourself governed by "little intolerant" elected officials and as history has showed us, they will soon be replaced by "little more intolerant"elected officials (who would win the elections by saying the former aren't intolerant enough) and the cycle will continue.

Since there are different definitions of the same term [3, 4], for sake of this blogpost, let me explain my interpretation of these terms.
1. You are a theist if you believe that God created the world and takes active interest in the world. For example: if you are a theist, in times of trouble you believe that God will intervene to provide you with guidance or in someway affect the outcome. How far and how often God intervenes depends on degree of your theist beliefs.
2. Almost all theists also follow one of many organized religions, which are based on revelations given by prophets or reincarnated Gods. It is important to note few things about organized religions:

There is a clear distinction between prophets and ordinary people, former being first-class citizens who are the only people that have access to these revelations.
Since revelations are the only mechanism for finding or validating "religious truths", it makes scientific inquiry impossible. This is because as of this day, there is no clear way to distinguish between the revelations of first-class citizens and hallucinations/delusions of ordinary people. This is also true for spirituality. So, religion has a concept of faith which requires you to believe something because the first-class citizens said so [5].
These revelations can be categorized into two parts: information about God (and hence the universe) and information about how to conduct ourselves in the world. Often, the former (for simplicity, let's call it knowledge) is used as main reason to stick to the latter (let's call it moral guiding principles or human values) in our day-to-day life. By definition, knowledge is considered to be absolute and universally-applicable [6] whereas the moral guiding principles are treated as relative [7]. The consensus among many intellectuals is that the scientific method is the best available tool for exploring and validating knowledge. For moral guiding principles, people often use either religion or philosophy or spirituality or law or introspection or advice from friends and family [8].
Most revelations are described in holy books and are often accompanied by a back story. Many anti-theists criticize the revelations because of validity of the back story rather than applicability of revelations themselves. Some theists believe them to be exact historical account whereas others theists believe them to be embellished so as to get the message through to the general public [9]. In these back stories, there is often mentions of so-called miracles and many use the argument that specification of these miracles is a proof of validity of the revelations and hence God [11]. However, non-reproducibility and often embellished centuries-old second-hand accounts make the likelihood of these miracles highly improbable, if not impossible, and hence they are not accepted as facts by the scientific community.
Even though it is often suggested that all religions say the same thing [12], many organized religions are very clear on one particular requirement: "To adhere to religion X, you must ignore teachings of all other religions and consider them to be pagans". So, ipso facto all religions don't say the same thing, it is what we chose to believe to promote tolerance. I, on the other hand, think we should celebrate pluralism and not be afraid of the fact that different religions say different things. The only reason we ever need to be tolerant is coexistence and survival of human race.
Many religions are based on the concept of two-world: physical world (where we live) and transcendent world (which is termed as heaven, paradise, etc). Some religions that have concept of rebirth and karma (eg: Buddhism and Hinduism), also have concept of enlightenment (nirvana or moksha), which is often referred to as state of being, rather than a physical location (like heaven). The hidden assumption in two-world philosophy and most religions is that human beings and earth has undivided attention of God and hence has special and central place in this universe. Science makes no such assumption and in fact many scientists believe human race and earth to be a product of a wonderful accident.

3. Science on other hand is based on "scientific method", which Oxford dictionary defines as "a method or procedure that has characterized natural science since the 17th century, consisting in systematic observation, measurement, and experiment, and the formulation, testing, and modification of hypotheses". Since there are already enormous amount on literature on this I won't discuss it in more detail. However, I do have few advices for young scientists:

Make sure you understand how to differentiate between science and pseudoscience.
Don't be afraid of uncertainty and not knowing everything. It's OK to say "I don't know", you will be surprised how often that phrase is used by renowned scientists.
Avoid the urge to be self-centric and to accept something to be fact because you feel it "deep inside" [12]. Also, don't make up stories, that have no scientific bases, to explain things you don't understand.
There is no such thing as faith in science. It is replaced by reasoning and experimentation. For example, when you read a research paper or a book, always exercise critical reasoning on it to make sure that the results are plausible. If you think something is fishy, conduct an experiment to validate its result and hypothesis [5]. This is also true when a professor explains you laws of motion or electricity or something else and claims it as science.
For reasoning and validation, stick to Occam's razor, i.e. don't just jump to the easiest/quickest answer, instead put effort to find the answer which requires fewest assumptions. For experimentation, at minimal, understand the correlation-causation fallacy and also statistical testing. One particular book that that you can start with is How to Solve it by Polya.

Even though both religion and science started because of human curiosity to explain the universe [13], both employ completely different methods/practices: revelations and scientific method, respectively. To make sense of the debate between them, it is necessary to recognize that these methods are inherently incompatible. So, unless all the parties agrees on a common method for finding and validating facts/truths, I think the whole argument will keep on spiraling itself. So, if a religious person choses to use the scientific method to convince me of the validity of the revelations, I welcome it; if not, I prefer not to engage in the discussion at all.

Just like religious missionaries, let me take a moment to state why you should use scientific inquiry for finding facts and solving problems ;) Look around what scientific inquiry has enabled us to discover and invent: medicines, cars, internet, computers, phones, planes, trains, camera, music players, guitars, paints, bricks, electricity, bulbs, etc. Whether you are a student or a nobel prize winner, a poor person or a millionaire, a prisoner or the president, your experimental findings and ideas will be judged purely on its merit and nothing else ... not on your status or gender or age or or religion or race or caste or nationality. You are allowed to ... no, you are encouraged to come up with new ideas, ask questions, doubt and investigate widely held beliefs. In fact, I urge you to doubt everything in this blogpost, discuss it with your teacher/professor/colleagues/priest/friend/family, read books about both religion and science and come up with your own opinions :)

Let me summarize this blogpost in a sentence: Whether religion or spirituality is inevitable and necessary part of of human life, I don't know; but I am absolutely sure that they don't have place in science classroom or textbook.

References and footnotes:
[1] The paradox of tolerance: http://jefflees.wordpress.com/2010/01/22/the-paradox-of-tolerance/
[2] The legal argument about tolerance gets more complicated if majority decides to collude to deliberately oppress the few, which has happened far too often in the past. However, we won't get into that and let's rely on inherent human goodness and assume that there are constitutional provisions to disallow that.
[3] http://en.wikipedia.org/wiki/God
[4] The Cambridge companion to atheism
[5] Scientists are expected to practice same level of skepticism to the theories of even the most accomplished scientists (be it Newton, Einstein, Darwin, Turing or Feynman), i.e. science does not have authorities, just heroes.
[6] Even though I called knowledge to be absolute and universally-applicable, it was little misleading. Read research journals to find just how many previously accepted theories are replaced by new theories in light of new evidence, which essentially means knowledge and science is ever-evolving. Anyways, I urge you to read about epistemology for understanding the nature and scope of knowledge.
[7] Religious fundamentalists believe "how to conduct yourself" is absolute (i.e. strict adherence to revelations).
[8] Recently, I encountered a person who was against organized religion, but claimed that spirituality is better tool than science for both exploring knowledge and for moral guiding principles. He didn't provide me with any evidence or rationale to support his hypothesis, so I disagree with him on that. It is also important to point out here that Sam Harris suggests that science can also determine human values.
[9] Irrespective of which is true, one thing I submit to is that they probably were the best collections of ideas of their generations. However, in this age of globalization and democracy, I don't think if they are applicable.
[10] At the international level, with so many dictatorships, monarchy, theocratic governments, territorial disputes and dogmatic/partisan politicians around, it is difficult to come up with one principle that will work every where. Non-intervention (or non-alignment) expects every country to make decision before you stop a genocide (example: EU and Bosnia). Respecting every nation's sovereignty means pulling out your troops from unstable governments (like Iraq and Afghanistan) or countries surrounded by superior military powers (like Kuwait, South Korea, Taiwan or Japan). Think non-violence and Tibet. So, this blogpost is only applicable to citizens of secular nations with access to freedom of speech.
[11] If you believe in miracles, aren't miracles of science (pun intended) equally mind-blowing, for example: flying objects, going to the heaven, create new life, curing the sick, all-knowing oracle, walking on water, being invisible, etc.
[12] Penn Jillette: Why tolerance is condescending ?
[13] Of course, many religious people will disagree with this opinion and will point out that religion exists because God wanted it to exists, not because humans were fishing for answers.

For those people who followed the entire line of argument, I must admit that I deliberately skipped the underlying question about existence of God in non-religious/non-theists, atheist, agnostics settings.

Hadoop MapReduce - System's view

2013-03-27T19:49:00.001-04:00

Here are videos that I created to explain Hadoop's MapReduce framework from system's perspective. As a disclaimer, this video is at an abstraction that I liked, which means it might not be suitable for general Computer Science audience. If however, you have some experience with Hadoop, you should be fine. I have skipped over lot of Hadoop's features for simplicity and succinctness, like HDFS, Hadoop I/O, setting up/administering Hadoop cluster and writing Hadoop programs (and various high-level languages/tools that use Hadoop).

1. Job Submission at the client side

2. Job submission at the JobTracker side

3. Task Scheduling

4. Task Creation

5. Map execution

6. Reduce execution

Slides: PDF, Keynote

Video tutorial on reinforcement learning

2013-03-14T15:49:00.001-04:00

This video tutorial is based on a talk I gave at my Comp 650 class and hence assumes that the audience knows a little about Markov Decision Processes and Partially Observable Markov Decision Processes. If you are unfamiliar with them, I recommend that you at least read through their wikipedia pages before going through this tutorial. Also, I suggest that you view this video on youtube, because the embedded video given below is cropped.

If you have any suggestions, comments or questions, please feel free to email me :)

PDF slides

My views on recent protest in New Delhi

2012-12-25T16:47:00.000-05:00

According to 2011 census in India¹, there are 586469174 women in India, i.e. 48.46% of India's population and 8.38% of total world's population (which is roughly around the total number of English and Spanish speaking people in the world).

Out of these only 65.46% are literate² (as compared to 82.14% literacy rate in males), with the definition of literacy being very narrow.

Crimes against women³ include rape (20737 cases in the year 2007), dowry deaths (8093 cases), molestation (38734 cases), sexual harassment (10950 cases) and cruelty by husbands/relatives (75930 cases). To add to that, there is human trafficking (for sexual slavery or domestic servitude), female infanticide, discrimination, acid attacks, lack of health care and forced marriage. The number of rape cases has increased by 80% in 7 years from 2000 and is still increasing. In the year 2011, 26000 rapes were reported⁴ throughout India; in fact it is the fastest growing crime in India. According to Thomas Reuters Foundation⁵, India is the fourth most dangerous country in the world for women behind Afghanistan, Congo and Pakistan.

According to many, there are three main reasons behind this:
1. Social taboo:
We Indians (including me) are hypocrites when it comes to dealing with rape survivors⁶. They are outcasted to protect family's so-called honor and hence most of the rape cases aren't reported or are dropped. Also, in event of such crimes, most of us prefer not to intervene and help the victim (be it eve-teasing, dowry, rape, etc) and her family. This indifference in our attitude provides additional incentives to the rapist to commit such heinous crime.

2. Severity of punishment:
An unenforced law means nothing however stringent they are. So, I don't think severity of punishment acts as a deterrent to rapists. In fact, considering how our law enforcement agencies (both judicial and policing) are, the rapists might not be wrong to think they won't be apprehended at all. With such an alarming increase in rape cases, one might think it's reasonable to support increasing severity in special circumstances⁷ and especially for repeat offenders. But, the point to note is, in India, rape is punishable up to life imprisonment⁹ which I think is severe enough. The main problem is that it doesn't get enforced or it takes way too long.

3. Lack of speedy and fair trials with severe punishment to rape shield violators:
There already are provisions for speedy trials¹⁰ and rape shields¹¹, but don't get executed due to lack of resources and infrastructure. First, there are 26.3 million pending cases⁸. Then there aren't enough women officers (I believe there should be an option for the victim to talk directly to women officers in every police chowky, for reporting rape, molestation, etc). Lastly, even though India has one of the strongest rape shield laws, Chaudhry¹¹ says they are routinely violated or reduced to technicality in practice. It is imperative to punish any authority that violates the rape shield.

Let me reiterate my point: an unenforced law will solve nothing. I see the protestors (which btw I salute for their efforts, barring the ones that are doing it for political gains and media attention/ratings) are focusing on the wrong issue: asking for death penalty law for rape.

Why ? Researchers have shown again and again¹², certainty and not severity of punishment proves to be true deterrent for the crime. In fact, increasing severity can turn to be counterproductive (i.e. may lead to more false negatives).

Sure, increasing severity might help us, as the society, deal with the shock, but it will do nothing to help improve the safety of women. The law will be passed, authorities will keep on violating rape shield, there will still be little or no resources to handle the case quickly and we shall go to living our life once again forgetting this incident ever happened and at the same time congratulating ourselves for the job well done. If you do truly care about the situation, let's forget the quick fix and do a more difficult thing: Take a step back and dispassionately think how we can solve this problem once and for all. Here is what I could think of:
- Punishment or/and suspension of authority breaking rape shield (or discouraging victims for registering complaints).
- Increase in forensic and judiciary budgets to expedite trials.
- More women cops and psychological experts to help deal with the trauma. Also, create a special department for dealing with these crimes or train existing police officers to deal with such scenarios.
- Incentive programs for improving the female literacy rate.
- Standing up for the crimes against women when we see it, while still keeping our paternal instinct in check so that every women gets to enjoy her liberty.

It took me half a day to write this blogpost and I humbly admit I am missing lot of points. But, I am sure a more experienced panel can come up with better solution if they are not sidelined by politics or their own self-interest.

References:
1. Census of India: http://www.censusindia.gov.in/
2. http://en.wikipedia.org/wiki/2011_census_of_India
3. National Crime Records Bureau: http://ncrb.nic.in/cii2007/cii-2007/1953-2007.pdf
4. http://en.wikipedia.org/wiki/Rape_in_India
5. http://www.trust.org/documents/womens-rights/resources/2011WomenPollResults.pdf
6. http://www.ndtv.com/video/player/news/determined-not-to-hide-her-identity-rape-survivor-bravely-tells-her-story/259248
7. http://en.wikipedia.org/wiki/2012_Delhi_gang_rape_case
8. http://www.hindustantimes.com/News-Feed/india/Nearly-30-million-cases-pending-in-courts/Article1-224578.aspx
9. http://indiankanoon.org/doc/1279834/
10. http://defensewiki.ibj.org/index.php/Right_to_a_Speedy_Trial#India
11. http://www.firstpost.com/living/gurgaon-gang-rape-let-me-tell-you-all-about-this-girl-242185.html
12. http://www.sentencingproject.org/doc/deterrence%20briefing%20.pdf

Education: whether to skip school or not ?

2012-08-12T12:51:00.000-04:00

Couple of days back, one of my friend (Deepak) posted a post, which lead to somewhat heated debate. Before, I go any further, I must admit I have not read the blog writer's (Nikhil's) book and judging from the title "One Size Does Not Fit All: A Student’s Assessment of School", he seems to be radical thinker (as suggested by Deepak). Also, the above mentioned post might have assumed that the reader has already read his book and hence the post itself could be out of context for me.

Personally, I felt the post was just a collage of motivational/self-help pieces, without a direct and measurable theme/message. People don't need cheer-leaders that say "here are pictures/quotes of Einstein, Jobs, Ford ... yaay ... now, go and become like them it"; they need a coach ... a teacher, who gives them a clearly defined way or approach to achieve it ... stating how much effort one needs to put into it and what are pros and cons of following that approach. For example, 10,000 hour rule of Gladwell's Outliers or Allen's GTD. Sure there are gray-areas, situations where people only need motivational speeches and not solutions (for example: Notes to myself) ... and may be the above post could have been intended for exactly that.

Just a side-note (and not directing directly to the author of the post), I think writers of self-help books (or any non-fiction book for that matter), should try to aspire for journalistic integrity ... which means checking facts before you make a statement, validating your propositions and yes, being direct and clear. If you don't have time to research something (for example: if you are writing a time sensitive news article), or are making some assumptions (say, about the audience or the context of the article), add a disclaimer and make a reasonable statement based on Occam's razor. And most importantly, be very critical of the implications of what you write. Sure, you are allowed to be wrong (in few if not all scenarios, for example: if you track scientific research, every once in a while an idea proposed few years back is discarded). In fact there is a mathematical argument, that you cannot prove anything to absolute certainty. But it should at least pass a sniff test of critical thinking: If you make your case to a rational person without any prejudice on that particular subject, is he/she going to dismiss it outright ? An example of totally ridiculous argument that does not pass this sniff test is Zakir Naik on NDTV about Osama bin Laden, which might not be violation of first amendment (i.e. freedom of speech); but clearly can have damaging consequences, especially if audience is ill-informed and trusts the speaker.

Now coming back to the debate (given in the below image ... the first line that is cut off is "Sorry Deepak, I don't see any tangible idea in his blog ... may be i am missing something ... sure, ") with my friend, the sentence in the blog that ticked me off was "Drop the textbooks. Get out and do it":

After this debate/conversation, I felt like I was just a critic, bashing the post and nitpicking on a single possibly unintended sentence, rather than contributing to the subject. Hence, I decided to write this post to give my views on education.

I am doing PhD, so there is going to be bias in support of education. So I ask you to consider the points discussed below with respect to your situation, taking into account my personal bias.

I honestly believe formal education is easiest and fastest way to success (though many students don't feel like that's true due to amount of competition they see around). You can either work really hard during four years of graduation or struggle rest of forty years of your career to compensate for those four years. There are exceptions of course, for example, athletes who prefer/need to horn their skills than study (like Sachin Tendulkar, Usain Bolt), entrepreneurs for whom opportunity cost is higher than education (like Bill Gates, Steve Jobs, Richard Branson) and people who are forced to discontinue their formal education due to certain circumstances (Nobel laureate Doris Lessing). Lot of books and post by these entrepreneurs advocate to try to be these exceptions, but I suggest caution. They are called "exceptions" for a reason; count number of people (either in your locality or in the world) who left their schools versus number of people who attended the schools, and then use your measure of success to classify them as successul or unsuccessful ... you will find your answer :)

So, why do some smart people (like authors of books mentioned in above paragraph) suggest on leaving the schools ? Well, you need to read between the lines. If you have fairly concrete idea of your product/service (along with somewhat reasonable plan) and believe waiting to finish your education will cost you the competitive edge, then it might be a good idea to take a sabbatical and be an entrepreneur. Also, read advices and books on startup, understand the novelty, amount of hard work and dedication required in building a successful startup before you make your decision of leaving the school.

Regarding Sean Parker's advice "skip school, google your education", it doesn't work. Government spends lot of money in educational system (such as deciding curriculum, giving grants, hiring/educating teachers, etc) and it will be a bad idea to not capitalize on it. I would modify his advice: "continue your school, but also google your education". There are lot of websites that give free courses/talks (like coursera, MIT, TedTalks etc), utilize them. Read wikipedia, books (textbooks and other non-course books), research papers, newspapers and blogs, then exercise critical thinking and combine them with advices given by your teacher, to formulate your own opinions/views ... world really needs more informed people. Also, there are lot of classes/camps, that you can attend/audit to supplement your education. For example: I completed PG Diploma course in Embedded Systems from ECIL (a course offered to only graduates) during my second year of engineering at VJTI. Two things I learnt during this experience is (1) even though there are pre-requisites, educators are willing to overlook them if you show enough enthusiasm and determinism; (2) if you chose to break the rules, be ready to put twice the amount of effort your peers are putting. During my four years of engineering, every semester, I always had a parallel course or non-course project that I was working on. This helped me immensely after my graduation. My point is you don't have to succumb to conformity (there are ways to supplement your formal education), and if you really plan to do leave formal education, make sure you are doing it for right reason (and not because you are lazy or because you feel leaving studies will somehow magically make your life easier and especially not because someone cites you a quote from a successful person).

Here is a high level intuition of why formal education seems tedious and boring: Society wants to filter out people who cannot survive the dip. (The dip is a period of long struggle that is necessary to achieve the goal and be differentiated from general public). For example, if PhDs or medicine were not as difficult as they are, there would be abundance of doctors and PhDs, and the elite status they enjoy now would be gone; so the society wants to ensure that these so-called elites really have to struggle before they can be labelled into this category. Sure, there is other reasons too: Some degrees (PhDs, MDs) requires knowing lot of stuff, and hence the long graduation period. You can't become a doctor by just knowing how to apply a bandage or by memorizing prescriptions for common diseases. For sake of this argument, say if a person can, would you want your family member treated by such person ?

That leads to the question of what are rational reasons to quit your studies (other than the case discussed above): If you think (or can forecast that) after surviving the dip, you don't like what's end of it, then it might be a good idea chose a different path. For example: during first year of medical school, if you don't like the life of doctor, it might be a good idea to drop out of medical school and do something that you are really passionate about. But, it is clearly a bad idea to drop out just because you feel medical school is hard or there is too much competition. Similarly, there are also rational reasons to quit PhD, but all of them should be accompanied by careful thinking and introspection. The book also covers other scenarios such as dead-end, where you have no hope of moving ahead and are stuck in routine ... but I don't think education fits into that scenario.

Note, if you believe you have dyslexia, bipolar or any such health issues (either physical or mental), then you need to completely ignore my advices and consult your family and/or therapist, as they will give you better suggestions than this blogpost.

Improving speech recognition using clustering-based adaptive model assignment

2012-07-16T16:17:00.000-04:00

I came up with this idea while working on Voca. Since I don't intend to include this as part of my PhD thesis, write a research paper (since I don't have time to perform experimental evaluation) or make money out of it by patenting it (thanks to extremely slow Rice OTT process), I thought it might at least make a good blog entry.

In this blogpost, I will discuss a novel approach to improve speech recognition that in-turn will help improve the accuracy of Voca desktop app (if and when I decide to work on it). This approach can be applied to any speech-to-text software that is to be distributed to masses (eg: Siri).

1. Background

1.1 Introduction
Automatic speech recognition (ASR) involves predicting the most likely sequence of words W for an input audio data A. To do so, the speech recognizers use the following optimization function[2]:
\begin{align*} \hat{W}&= {arg\,max}_{W} P(W | A) & \\ &= {arg\,max}_{W} \dfrac{P(W) P(A | W)}{P(A)} & \dots \text{ Bayes theorem} \\ &\propto {arg\,max}_{W} P(W) P(A | W) & \dots \text{ The denominator normalizes to a constant} \end{align*}

The term P (W ) means the probability of occurrence of given sequence of words and is defined by the language model. At a high level, the language model captures the inherent restrictions imposed by the english language. For example: a language model might imply that the sentence “it is raining heavily in houston” is more likely sentence that “raining it is in heavily houston”.

The term P(A|W) refers to the probability of an audio data given the sequence of word and is captured by the acoustic model, which at a high level is probabilistic mapping of the acoustic sound to the words. The acoustic model is usually found by training hidden markov model on a small subset of users. The users that train the acoustic model are referred as trainers in subsequent sections.

To summarize, the speech recognizer searches for the sequence of words, that maximizes the above function using the acoustic and the language model. The output of this search procedure is a tuple < words W, confidence >, where confidence refers to the value of the above function for W. For example: <“it is raining heavily in houston”, 0.9 >
Thus, accuracy of speech recognizer thus highly depends on these two models. Since only acoustic model varies with respect to the user, in our further discussion we will only refer to acoustic model and assume the use of a standardized language model (eg: HUB4).
1.2 Roles

There are three types of people involved in speech recognition process:

Developers write the code for the speech recognition software.
Trainers record their voice and give the audio files to the developers. Developers then use these audio files to train the acoustic model.
Users are the general audience who uses the speech recognition software. The number of trainers is usually much less than number of users.

1.3 Steps in speech recognition process

The steps in typical speech recognition API/software is as follows:
1. Before releasing the API/software:

(a) Developers encourages trainers to record their voice.
(b) Developers use the audio files to train a universal model for all the users. The developers may also use information provided by experts to improve language model.
(c) Developers bundle together the model and the search algorithm and release the API/software.

2. After releasing the API/software:
(a) The search algorithm in the API/software tries to find what the user is saying by comparing the user’s acoustic signal with its universal model using various statistical methods.

1.4 Challenges

However, speech recognition is inherently a hard problem due to issues like:

Differences in speaking style due to accent, regional dialect, sex, age, etc (Eg: the acoustic signature of a young chinese girl saying some words will probably differ from an older german person speaking those exact same words).
Lack of natural pauses to detect word boundaries.
Computational issues like search space, homophones (words that sounds same), echo effect due to surroundings, background noise.

Lot of research effort has been put to tackle issues due to point 2 and 3. To tackle issue 1, the most common solution is to train on as many people as possible to create an somewhat universal acoustic model that captures these differences (as mentioned in step 1b of section 1.3). This is because most people do not want to spend hours training the acoustic model and would prefer an out-of-the-box solution. The problem is this approach is that it always creates a biased acoustic model that works well for some people, but is highly inaccurate with many.

2. Summary

Instead of using one universal model that works for everyone, we propose the following approach:

Train k different models in step 1b of section 1.3 (where is k is reasonably small number, for example: 100) such that each of these models try to capture the speaking style of a particular group of people. Then, recommend the model that best suits the user depending on his/her acoustic similarity with each of the above groups.

2.1 Why this idea will work in practice ?

It is a well known fact that the accuracy of the speech recognizer is higher if the trainer and the user are the same person.
As pointed out earlier, most users do not want to spend hours training the model.
Some people are more acoustically similar than others. For eg: model trained on young native american student will suit well for a similar student, but might not suit well for an old Indian lady. Hence, if the trainer and the user speak very similarly, the model will have higher accuracy than an universal model.
We like to point out that users can be clustered into groups by the way they speak and the number of these groups are finite. This is basically a soft assignment, for eg: I would fall in the category American accent with probability 0.2, but in category Indian accent with probability 0.7. Thus this knowledge can be incorporated in the recognition process.

2.2 Related Works

An area that is close to our invention is speaker-adaptive modeling[1], where the universal model is adapted to the user based on training on speaker- dependent features. This means, the parameters of acoustic models are split into two parts:

Speaker independent parameters
Speaker dependent parameters

This division requires user to supply relatively smaller training set (which need not be done during initial setup). However, it still has following difficulties

Training is significantly compute-intensive task and might not be a viable option for smart phones, netbooks or even laptops. Offloading the training on to server will involve more network usage (which in turn will increase the cost of speech recognition for user or/and developer).
Training might require expert-intervention.
It still is not an out-of-the-box solution as an off-line solution will require some training to make it speaker-adaptive.

For n users (where n ≫ k), Speaker-adaptive systems in effect will have n models and our clustering-based model assignment approach will have k models. Since k models are trained on the server, model training will also benefit from expert-intervention without any additional cost to the user. It is important to note that model assignment does not require any expert- intervention and is not same as speaker-adaptive training.

3. Detail discussion

3.1 Key conditions

Every user can be represented as a mixture model over the acoustic model. Example: Let’s say that user U1 is sampled with mixture proportion {0.9, 0.1} over model m1 and m2 respectively. This means that if we run recognition on the audio file generated by the user U1 on both the models, it is likely that model m1 will produce output that is going to be 9 times more accurate that model m2.
When we create a model, the job of the expert is to make sure that:

(a) Each model is representative of a distinct accent.
(b) Every user (on average) is expected to have high affinity to only 1 model.

This means if we have a two sets mixture models {0.002, 0.99, 0.003} and {0.33, 0.33, 0.33} for a random user, the former configuration is preferable.
3. Corollaries based on above points:
(a) A model is simply a cluster of users.
(b) A user can belong to multiple cluster.

3.2 Modified steps for speech recognition

We propose a modified methodology that will improve the accuracy of the speech recognition:
1. Before releasing the API/software:

(a) Developers encourages trainers to record their voice.
(b) Using the model described in section 3.3, we get k models (For example: k = 100).
(c) Chose one of these k model as a default model. Also, set a value for the parameter n (For example: n = 3).
(d) for i = 1 to n do:
- Select a new phrase p_i such that all k models have at least one audio file that have phrase p_i or one very similar to the phrase p_i. Hint: During training process, ask every trainer to speak the phrase p_i. Use edit-distance and some tolerance delta (= 5) to find similar phrases.
(e) Developers bundle together the default model and the search algorithm and release the API/software.

2. After releasing the API/software:

(a) During installation (or during periodic re-tuning), for i = 1 to n do:
1. Ask user to speak the phrase p_i and send the audio file to the server.
2. For each model j, do:
  1. Do speech-to-text transcription for the given audio file.
  2. Let p′_i,j be the output phrase of the transcription.
Let ε_i,j be a measure that calculates the error in transcription of phrase p_j on model j.For example: ε_i,j = Edit-Distance(p′_i,j,p_i,j). Note that, depending on the training data, an expert might chose to use a different metric (before releasing the API) such as acoustic similarity between a random audio file in the model and the audio file outputted by the use.
(b) Return the model to the user that has minimum error for all the phrases
(c) The user uses the model that is send by the server rather than a universal model.

3.3. Model for clustering the trainers

Step 1: Create two sets:

Set 1 (Set of models): Models that have been found to contain similar users. Initial value = {}.
Set 2 (Set of trainers not yet clustered): Put all the trainers in Set 2

Step 2: Train an acoustic model M using audio files of trainers in set 2.

Step 3: Run recognition for the audio files of the trainers in set 2 ⇒ You get ε_i (i.e. Edit-distance between output phrase of recognizer and the true phrase). This means some of the audio files are used as held-out. It might also be interesting to have no held-out and see the error when the training and test data are the same.

Step 4: Find maximal subset of the trainers whose Zi matches the trainers for all the phrases ⇒ Train the model using these trainers and add it to Set 2 (after a sanity test i.e. this model should be able to predict a random set of audio files that it was trained on)

Step 5: Put the remaining in Set 2 and goto Step 2.

4. References

[1] X. Huang and K.F. Lee. On speaker-independent, speaker-dependent, and speaker-adaptive speech recognition. Speech and Audio Processing, IEEE Transactions on, 1(2):150 –157, apr 1993.
[2] Geoffrey Zweig and Michael Picheny. Advances in large vocabulary continuous speech recognition. volume 60 of Advances in Computers, pages 249 – 291. Elsevier, 2004.

Cross-validation on SVM-Light

2012-02-02T19:34:00.000-05:00

Before reading this blogpost you should read: Practical guide to libSVM, Wikipedia entry on SVM, Wikipedia entry on cross-validation, and PRML - Bishop's chapter on SVM. In this post, I will not explain how to use open source SVM tools (like libsvm or svmlight), but would rather assume that you have used them and want some clever script to automate them. Also note that, the below script uses default parameters. So if you get poor results, you can add an outer for loop that iterates through your choice of parameters and supplies them to the program ./svm_learn.

To perform cross-validation on libsvm:

Download and extract it
You need gnuplot installed on the machine you want to run libsvm. If you want to run it on the server, do 'ssh -X myuserName@myMachineName'
make
Go to Tools folder and run 'python easy.py my_training_file my_test_file'

To perform cross-validation on svmlight:

Download and extract it. There you will find two important binaries: svm_learn and svm_classify
Unfortunately the data gathering using svmlight is not easy, so I have coded following script.

#!/bin/bash
# Usage: svm_light_cross_validation full_data num_train_items num_test_items k_cycles results_file
RESULT_FILE=$5

echo "Running SVM-Light via cross validation on" $1 "by using" $2 "training items and" $3 "test items (Total number of cross-validation cycles:" $4 > $RESULT_FILE

MODEL_FILE="model."$RANDOM".txt"
TEMP_FILE="tempFile."$RANDOM".txt"
PRED_FILE="prediction."$RANDOM".txt"
DATA_FILE=$1
NUM_TRAIN=$2
NUM_TEST=$3
NUM_CYCLES=$4

TEMP_DATA_FILE=$DATA_FILE"."$RANDOM".temp"
TRAIN_FILE=$TEMP_DATA_FILE".train"
TEST_FILE=$TEMP_DATA_FILE".test"

TEMP_RESULT=$RESULT_FILE".temp"
SVM_PATTERN='s/Accuracy on test set: \([0-9]*.[0-9]*\)% ([0-9]* correct, [0-9]* incorrect, [0-9]* total)\.*/'
for k in `seq 1 $NUM_CYCLES`
do
 sort -R $DATA_FILE > $TEMP_DATA_FILE
 head -n $NUM_TRAIN $TEMP_DATA_FILE > $TRAIN_FILE
 tail -n $NUM_TEST $TEMP_DATA_FILE > $TEST_FILE
 
 echo "------------------------------------------"  >> $RESULT_FILE
 echo "Cross-validation cycle:" $k >> $RESULT_FILE
 
 # first run svm with default parameters
 echo ""  >> $RESULT_FILE
 echo "Polynomial SVM with default parameters" >> $RESULT_FILE
 for i in 1 2 3 4 5 6 7 8 9 10
 do
  echo "order:" $i >> $RESULT_FILE
  ./svm_learn -t 1 -d $i $TRAIN_FILE $MODEL_FILE > $TEMP_FILE
  ./svm_classify -v 1 $TEST_FILE $MODEL_FILE $PRED_FILE > $TEMP_RESULT
  cat $TEMP_RESULT >> $RESULT_FILE
  sed '/^Reading model/d' $TEMP_RESULT > $TEMP_RESULT"1"
  sed '/^Precision/d' $TEMP_RESULT"1" > $TEMP_RESULT
  sed "$SVM_PATTERN$k poly $i \1/g" $TEMP_RESULT >> "better"$RESULT_FILE
 done

 echo ""  >> $RESULT_FILE
 echo "RBF SVM with default parameters" >> $RESULT_FILE
 for g in 0.00001 0.0001 0.001 0.1 1 2 3 5 10 20 50 100 200 500 1000
 do
  echo "gamma:" $g >> $RESULT_FILE
  ./svm_learn -t 2 -g $g $TRAIN_FILE $MODEL_FILE > $TEMP_FILE
  ./svm_classify -v 1 $TEST_FILE $MODEL_FILE $PRED_FILE >> $TEMP_FILE
  cat $TEMP_RESULT >> $RESULT_FILE
  sed '/^Reading model/d' $TEMP_RESULT > $TEMP_RESULT"1"
  sed '/^Precision/d' $TEMP_RESULT"1" > $TEMP_RESULT
  sed "$SVM_PATTERN$k rbf $g \1/g" $TEMP_RESULT >> "better"$RESULT_FILE
 done

done

rm $MODEL_FILE $TEMP_FILE $PRED_FILE $TEMP_DATA_FILE $TEMP_RESULT $TEMP_RESULT"1"
echo "Done." >> $RESULT_FILE

Here is how you run this script:

./svm_light_cross_validation myDataInSVMLightFormat.txt 5000 2000 10 MyResults.txt

myDataInSVMLightFormat.txt looks like "binaryClassifier(-1 or 1) featureNum:value ....". Example:

1 1:1.75814e-14 2:1.74821e-05 3:1.37931e-08 4:1.25827e-14 5:4.09717e-05 6:1.28084e-09 7:2.80137e-22 8:2.17821e-24 9:0.00600121 10:0.002669 11:1.19775e-05 12:8.24607e-15 13:8.36358e-10 14:0.0532899 15:3.25028e-06 16:0.148439 17:9.76215e-08 18:0.00148927 19:3.69801e-16 20:4.20283e-16 21:7.9941e-05 22:5.7593e-09 23:0.000251052 24:0.000184218 25:7.07359e-06

After running this script you will get two important files: MyResults.txt and betterMyResults.txt (which is a space separated file with format: 'validation_cycle_num [poly or rbf] parameterValue accuracy'). You can then import betterMyResults.txt in R or Matlab and get some pretty graphs :)

Using VIM for "fast" data transformation

2012-02-01T18:10:00.002-05:00

Recently, I had to give my data to a lab-mate. I had done some pre-processing on 20 Newsgroup dataset and converted it to following format:
- Each line represented a document.
- Words were replaced by an integer that denoted it's index in vocabulary.txt file
- The line was append with a binary classifier (1 or -1) based on whether the document was in "comp" newsgroup or "sci" newsgroup.
- The words were represented along with their frequency.
For example, let's assume for now the word "technology" is 1002th word in the vocabulary. Then, if the word "technology" appeared 10 times in 12th document (in the folder comp.graphics), then 12th line in my file will be:
1 1002:10 ...

Here is a sample line in the file:

1 3798:1 9450:1 12429:1 13352:1 14155:1 15858:1 22319:1 29652:1 31220:1 34466:2 35883:1 37734:1 40188:1 40683:1
1 90:1 1078:1 2101:2 3775:1 5183:2 5195:1 5747:1 7908:1 7963:1 9748:1 11294:3 14879:1 16006:2 18188:1 19742:1 20928:1 21321:1 21935:1 23613:1 25354:2 26721:1 29652:1 30407:1 34054:1 36546:2 38252:1 39376:2 40204:1

Now, I had to convert this to following format:
fileNum | wordIndex | frequency
For example:

3|3798|1
3|9450|1
3|12429|1
3|13352|1
3|14155|1

I used following VIM tricks:
1. First delete the binary classifiers (and also blank spaces at the end of the line):

:%s/^-1//g
:%s/^1//g
:%s/ $//g

2. Then, comes the most important part: Insert line number before every word.
To do this, I used following trick: - First replace spaces by some arbitrary character, in my case '='
- Then, replace that character by the line number !!!

:%s/ / =/g
:g/=/exec "s/=/ ".line(".")."|/g"

The last command needs some explanation:

It says, execute a command globally whenever you see '=' character: :g/=/exec some-command

In our case the command is replacing that character by the line number. The line number of a given line can be found using VIM's internal function: line(".")

Hence, the input file is converted into following format:

3|3798:1  3|9450:1  3|12429:1  3|13352:1  3|14155:1  3|15858:1  3|22319:1  3|29652:1  3|31220:1  3|34466:2  3|35883:1  3|37734:1  3|40188:1  3|40683:1
  4|90:1  4|1078:1  4|2101:2  4|3775:1  4|5183:2  4|5195:1  4|5747:1  4|7908:1  4|7963:1  4|9748:1  4|11294:3  4|14879:1  4|16006:2  4|18188:1  4|19742:1  4|20928:1  4|21321:1  4|21935:1  4|23613:1  4|25354:2  4|26721:1  4|29652:1  4|30407:1  4|34054:1  4|36546:2  4|38252:1  4|39376:2  4|40204:1

3. Now, to some garbage cleaning and trivial replacement:

:%s/:/|/g
:%s/  / /g
:%s/^ //g

4. Finally, replace spaces by a newline character:

:%s/ /!!!!!!/g where !!!!!! is press Control+V and then press Enter

Control-V is the special character escape key. Remember the character '\n' won't work here.

And voila you are done !!!

Iris - siri like software

2012-01-17T18:24:00.000-05:00

Recently I have started a pet project Iris¹ while doing background study for my recent paper.

Voice commands/response:
Like Siri, it takes user commands and can respond using either Voice or Text.

Portable:
Since it is written in Java, it can run on most OS, including Mac, Windows and Linux. I have only tested it on Mac Lion, so I welcome any beta testers using other OS. For now, some of the features like play itunes, open email, etc are only available on Mac. I plan to add these features on other OS as well.

Configurable:
Most of the features such as use of VOICE or TEXT as feedback mechanism, as well as feedback message are configurable through the configuration file ''user_scripts/iris.config". You can also modify any of the predefined apple scripts (for eg: playSong.scpt) in that folder which will be called by Iris on the voice cammand. Also, for advance users, there are ten function commands that he/she can use to perform user-specific tasks.

Easy to install and use:
All you need to do is download and extract the binary file and run following command:

 
java -Xmx200m -jar bin/Iris.jar

The above command asks Java to run the program Iris.jar (in bin folder) and also specifies that it should not use more than 200 MB of memory.

Lightweight:
The whole setup takes about 20MB.

Client-side:
The speech-to-text is done on your laptop/desktop and not on some remote server. This means you can use it even if have no internet connection. This also means that there is no possibility of some big brother monitoring your voice commands.

Open source:
You are free to download and modify the source. As I am working towards a paper deadline, I am looking for collaborators that would help me with following features:

User-friendly: Improving voice commands
More accuracy: Expanding dictionary to include more words and also to add different 'user accents' for a given word.
Expanding grammar to include arbitrary spoken words.
Apple scripts to accessing contacts and recognizing contact names to assist users with various functionality.
Adding GUI interface rather than command line. Since all the responses go through OutputUtility.java, this is going to be easy.
New features: Support for RSS, Top News, Weather, Popular city names, Dictionary/Theasurus, Web/Wikipedia search, Dictation (to text/messages/events/todo item/alarm/reminders), Jokes, Motivational quotes, More Personalization.

If you have suggestions for improvement or want to collaborate with me, email me at 'NiketanPansare AT GMail'.

Iris uses CMU's Sphinx4 for recognizing speech and FreeTTS for text-to-speech conversion.

References:
¹ Iris is name of my friend's dog (well, no pun intended when I use the term 'pet project'). It is also the anagram of the word 'Siri' (a popular Apple's software for IPhone 4s).

C++ workflow

2012-01-02T00:02:00.001-05:00

Recently I had a memory bug that took me 2 days to figure out. Valgrind gave no error message (well, to be honest, it did give out a warning that got lost in my log messages). gdb pointed to completely random place that led me to believe STL's stringstream was the culprit. Since there was no way I could have definitively say it's not the cause, I couldn't rule out that possibility.

My impatience got the best of me and I foolishly rewrote that part of code using a less elegant solution using arrays. Now gdb started pointing to STL's vector.clear() method and believe it or not I was using std::string to store in it, not some user defined class …. I had it … there is no way in the world I am going to believe that gdb was right (thanks to my confidence in Effective C++ and std::vector < str::string >). But, not knowing the bug in a 'single threaded' 'command-line based' C++ code with access to gdb and pretty printing was driving me crazy. My previous blogpost didn't help either :(

So, on the same evening, armed with my music player and hot cup of tea, I decided to ignore the bug and beautify my code with better comments … geek alert !!! That's when it hit me, due to my recent modification, I was writing past the array bounds in a for loop. The funny thing is it is one of the few memory bugs that valgrind does not report in its final assessment and my best guess is that I was writing over vector's header location (and vector only complains about that during its next operation (in my case was clear()) ... which for all intensive purposes is random).

Embarrassed by this stupid mistake that wasted my two days, I felt humbled to revisit my C++ work-flow.

So, here is a sample workflow:
- svn update
- Code using your favorite editor (Make sure you configure it to conform to your style rules).
- Set DEBUGFLAGS in your makefile that does static analysis (make my_proj).
- Test it on local machines using 'make my_test'
- Run valgrind if necessary and check against below given valgrind checklist
- make style
- make doc
- svn update
- svn commit -m "some meaningful message"

The key principle is:

Automate whenever possible (with emphasis on reusability)

Never ever start coding without an SVN (or cvs or git, etc.). It will save you lot of headaches if your machine crashes or if you want to revert back to a previous version.
If the project has multiple coders/testers/collaborators, it helps if you have some kind of project-management tool like trac or jira. Also, if you are going to release your code to public, you might want to consider project hosting sites like Google Code or source forge.
Though this might seem obvious, don't host the svn server on personal machine or at least make sure to take regular backups if you do.

Use a Makefile that takes care of all necessary tasks (compilation, cleaning, data generation, document generation, code beautification, etc) and a user-friendly README file. (Advance coders can consider automate). Here is a sample Makefile of my current project:

 
CC = g++
DEBUGFLAGS = -g -Wall -Wextra
MINUITFLAGS = `root-config --cflags` `root-config --libs` -lMinuit2 -fopenmp
LIBS = -lgsl -lgslcblas -lm -lstdc++

HEADER = Helper.h MathHelper.h LDA.h Configuration.h
SRC = main.cpp MathHelper.cpp Helper.cpp LDA.cpp
OUTPUT = spoken_web

spoken_web:     $(SRC) $(HEADER)
        $(CC) $(SRC) -o $(OUTPUT) $(DEBUGFLAGS) $(LIBS)

doc:    $(HEADER) spokenweb_doxygen_configuration
        doxygen spokenweb_doxygen_configuration 

r_data: GenerateData.R
        R --slave < GenerateData.R

style: $(SRC) $(HEADER)
        astyle --options=spokenweb_astyle_rules $(SRC) $(HEADER)

clean:
        rm -f *.o $(OUTPUT) *~ *.data parameters.txt metaFile.txt *.data

Use a document generator like doxygen. It forces you to write much cleaner documentation and more importantly cleaner interface. Trust me it will help if you had to reuse your code in future.
```
make doc
```
Use code beautifier like astyle that not only shields you against inelegant colleague/coder, but also personalizes the code if you like to do 'Code Reading' (Don't you clean gym equipment after workout, well, why not do the same with your code). Here is sample style rules that I love:
```
 
--style=java
--indent=spaces=2
--indent-classes
--indent-switches
--indent-preprocessor
--indent-col1-comments
--min-conditional-indent=0
--max-instatement-indent=80
--unpad-paren
--break-closing-brackets
--add-brackets
--keep-one-line-blocks
--keep-one-line-statements
--convert-tabs
--align-pointer=type
--align-reference=type
--lineend=linux
--suffix=none
```
If your company follows other style guidelines or if you don't want to change existing formatting, remove '--suffix=none' from the style rules and once you are done, you can restore the original unformatted files (*.orig).
```
make style
```
Familiarize yourself with program analyzers and debugging tools, and have scripts that automate them in your coding process:
- g++'s flags (See DEBUGFLAGS in my makefile)
- Static analysis tools like cppcheck (For more cases, I am satisfied with the above g++'s flags)
- Runtime memory analyzer tool like valgrind. (Please see below given Valgrind checklist, and also my previous blogpost).
- gdb: For stepping through the code for more comprehensive debugging
There are several paid tools that do much better job of analyzing your code that those mentioned above. There are
I like to follow standard directory structure:
./ Makefile and configure scripts
./src cc , h files (Use subdirectories and namespaces if necessary)
./bin build directory
./data data required for testing/running the code
./test test drivers
./doc documentation generated by doxygen

Maintain a list of code snippets (stored in Evernote or a personal notebook) or helper classes that you might have to use again. Note that the helper classes should have good interface and must be well documented (and robust). For example, I use GSL for writing sampling functions for bayesian programs, which has very ugly interface and requires some amount of book-keeping. So, I have coded a wrapper class that gives an R like interface which makes my code extremely readable, less error prone and I don't need to worry about any book-keeping or other numerical issues while calling them.

 
/// \defgroup StatDistribution R-like API for family of probability distributions
/// Naming convention:
/// I have followed the naming conventions of R programming language. 
/// Please see its documentation for further details
///
/// Here is a brief intro:
/// Every distribution that R handles has four functions. There is a root name, 
/// for example, the root name for the normal distribution is norm. 
/// This root is prefixed by one of the letters
///   - d for "density", the density function
///   - r for "random", a random variable having the specified distribution
/// Example: For the normal distribution, these functions are dnorm, and rnorm.
///
/// Important note: Since most bayesian calculation is done in log space, (almost) 
/// every density function is capable of outputing 'numerically robust' logarithmic value. 
/// Hence, use dnorm(x, mean, sd, true) in your code rather than log(dnorm(x, mean, sd))
///
/// Suported continuous distribution:
/// - Uniform (unif - r,d)
/// - Univariate Normal (norm - r, d); Truncated normal (tnorm - r,d);
/// - Multivariate Normal (mnorm - r, d, conditional_r, conditional_rt) (--> Only m = 2, 3 supported)
/// - Beta (beta - r,d)
/// - Inverse Gamma (invgamma - r,d)
///
/// Suported discrete distribution:
/// - Discrete Uniform (dunif - r,d)
/// - Dirichlet (dirichlet - r,d)
/// - Multinomial (multinom - r,d)
/// - Poisson (pois - r,d)
/// @{

/// @name Continuous Uniform Distribution
///@{
/// Returns uniform variable [min, max)
double runif(double min, double max);
double dunif(double x, double min, double max, bool returnLog = false);
///@}

/// @name Discrete Uniform Distribution
///@{
/// Returns uniform variable [min, max) --> i.e. [min, max-1]
int rdunif(int min, int max);
double ddunif(int x, int min, int max, bool returnLog = false);
///@}

/// @name Univariate Normal Distribution
///@{
double rnorm(double mean, double sd);
double dnorm(double x, double mean, double sd, bool returnLog = false);
///@}

/// @name Multinomial distribution
/// Both prob and retVal should be of length sizeOfProb
/// size, say N, specifying the total number of objects that are put into 
/// K (or sizeOfProb) boxes in the typical multinomial experiment.
/// If the array prob is not normalized then its entries will be treated as 
/// weights and normalized appropriately. Furthermore, this function also allows 
/// you to provide log probabilities (in which case, you need to set isProbInLogSpace to true)
/// Eg: double prob[3] = {0.1, 0.2, 0.7}; double retVal[3]; rmultinom(1, 3, prob, retVal, false);
///@{
void rmultinom(int size, int sizeOfProb, double* prob, unsigned int* retVal, bool isProbInLogSpace = false);
double dmultinom(int sizeOfProb, double* prob, unsigned int* x, bool isProbInLogSpace = false, bool returnLog = false);
///@}

So to sample N(10,2), instead of having code like:

 
gsl_rng* rng_obj = gsl_rng_alloc(gsl_rng_mt19937);
double sampledVal = gsl_ran_gaussian(rng_obj, 2) + 10;
gsl_rng_free(rng_obj);

I have to use following code (which if you have coded in R is more readable):

 
StatisticsHelper myStatHelper;
double sampledVal = myStatHelper.rnorm(10, 2);

Valgrind checklist:

First run valgrind's memcheck tool (valgrind --tool=memcheck program_name) and check:

The heap summary (only returns memory leaks, not memory corruption errors) Eg:
```
 
==19691== HEAP SUMMARY:
==19691==     in use at exit: 0 bytes in 0 blocks
==19691==   total heap usage: 12,126 allocs, 12,126 frees, 938,522 bytes allocated
==19691== 
==19691== All heap blocks were freed -- no leaks are possible
```
If number of allocs and the number of frees will differ in o/p, the try
valgrind --tool=memcheck --leak-check=yes --show-reachable=yes program_name

Steps to avoid memory leaks:
- If you are assigning pointers to other pointers, be very explicit as to who is responsible for the deletion of the memory.
- Delete every object created by new (especially in case of exceptions).
Intermediate warning/error messages irrespective of the heap summary. Eg:
```
 
==15633== Invalid write of size 4
==15633==    at 0x4163F3: LDAGibbsSampler::FillInSpeechToTextOutputs() (LDA.cpp:206)
==15633==    by 0x41565A: LDAGibbsSampler::LDAGibbsSampler(int, int, double, double, std::string, double, double, int) (LDA.cpp:28)
==15633==    by 0x402E26: main (main.cpp:34)
```
Note, sometime one error can propagate and throw error messages at different places, so don't panic.

The intermediate error messages can detect:
- Illegal read / Illegal write errors (like the message 'Invalid read of size 4' above): It can mean any of the memory corruptions given below. If the error message is not informative, run valgrind again with '--read-var-info=yes' flag.
  Common memory corruption bugs like these are:
  - Writing/Accessing out of bounds memory (either directly or by array oriented functions like strcpy)
  - Using an unallocated object
```
 
my_class* obj_ptr;
obj_ptr -> var1 = val; // or val = obj_ptr -> var1;
```
  - Using a freed object
  Note: Memcheck (tool of valgrind) does not perform bounds checking for stack or global arrays, so even if you don't get this error/warning message, it does not mean that there is no memory corruption errors.
- Use of uninitialised values (either on stack or heap) in user function: Message looks like 'Conditional jump or move depends on uninitialised value(s)'. To see information on the sources of uninitialised data in your program, use the '--track-origins=yes' option.
- Use of uninitialised values in system calls: Message looks like 'Syscall param write(buf) points to uninitialised byte(s)'.
- Illegal free/deletions: (Invalid free())
  - Calling free() (or delete) on the object that is already freed (or deleted).
  - Deleting object that is created on stack.
- Mismatched operators: (Mismatched free() / delete / delete [])
  - Calling free() on the object that is created by new operator.
  - Calling delete on the object that is created by malloc().
  - Calling delete[] on the object that is created by the new operator.
- Overlapping source and destination blocks: (Source and destination overlap in ...)
  - Look out functions like memcpy, strcpy, strncpy, strcat and strncat.

References:
- http://google-styleguide.googlecode.com/svn/trunk/cppguide.xml
- http://www.yolinux.com/TUTORIALS/C++MemoryCorruptionAndMemoryLeaks.html
- http://www.cprogramming.com/tutorial/memory_debugging_parallel_inspector.html

How to analyze algorithms ?

2011-10-11T14:59:00.001-04:00

Important assumptions:

This process of analysis should be independent of programming language and machine architecture. So you can't just write your algorithm in C++ and compare it using running time of some other algorithm written in Java.
Also, since each algorithm can take as an input arbitrary data structure (eg: graph, array, binary object, tree, ...), there has to be some standard way to specify it. For sake of our analysis, we use input size (n) as the measure for our analysis. For eg: length of array, number of vertices or edges of the graph, number of nodes of a tree etc.

The detailed explanations of these two points is given in Introduction to Algorithms by Cormen, Leiserson, Rivest, .. (CLR)

Setup:

Input of size n ==> Algorithm that takes T(n) steps ==> some output

The number of steps in the algorithm often depends on factors other than the input size. For eg: the number of steps for quicksort for the input [1, 2, 3, 4, 5] is different than as compared to the input [5, 4, 3, 2, 1]; even though both have same input size (and in this case same elements).
Hence, we will try to find 3 functions (upper bound, lower bound and Theta bound) that can capture the effects of these factors: $$ g_1(n), g_2(n) \text{ and } g_3(n) \text{ such that } $$ $$ T(n) \in O(g_1(n)), T(n) \in \Omega(g_2(n)) \text{ and } T(n) \in \Theta(g_3(n))$$ Many authors also write above equations as: $$ T(n) = O(g_1(n)), T(n) = \Omega(g_2(n)) \text{ and } T(n) = \Theta(g_3(n))$$

Big-O notations:

Now, let's understand definition of above notations: \begin{align} \text{To prove that } & g_1(n) \text{ is upper bound of T(n), i.e.} T(n) \in O(g_1(n)) \\ \text{ find two constants } & c, n_0 \text{ such that } \\ & \forall n \geq n_0, T(n) \leq c g_1(n) \end{align} For lower bound, change ≤ sign to ≥ sign and for theta bound find both lower and upper bound: $$ T(n) \in \Theta(g_3(n)) \text{ iff } T(n) \in O(g_3(n)) \text{ and } T(n) \in \Omega(g_3(n))$$
To understand these in more details, I suggest you solve problems from CLR chapter 3. Here are some example you can start with: \begin{align} 2n^2 \in O(n^3) & \text{ with } c=1 \text{ and } n_0 = 2 \\ \sqrt{n} \in \Omega(lg n) & \text{ with } c=1 \text{ and } n_0 = 16 \\ \frac{n^2}{2} - 2n \in \Theta(n^2) & \text{ with } c_1=\frac{1}{4}, c_2=\frac{1}{2} \text{ and } n_0 = 8 \\ \end{align} (Note: theoretically very large functions such as infinity^n or n^infinity can always be lower bounds. So, it is necessary to supply as tight bound as possible for any given algorithm. Mathematically, the notations for tight upper bound is small o and for tight lower bound is small omega, but we will ignore them for sake of this discussion)

Common measures for algorithms:

Above three notations are often replaced by the terms 'worst case', 'best case' and 'average case' complexity. But, more precise definitions are as follows: \begin{align} \text{Worst case}& T(n) = max \{ T(x) | \text{ x is an instance of size n} \} \\ \text{Best case}& T(n) = min \{ T(x) | \text{x is an instance of size n} \} \\ \text{Average case}& T(n) = \sum_{\text{input }x} Prob(x) T(x) \end{align} This means, to find average case complexity of algorithm, you need to find probability of a given input and use it to find expectation of T(n).

Pseudo code to Big-O:

Now, let's address the key question: given a pseudo code, how do we find its lower bound (or upper bound or ...). Here is one possible way to find those bounds:

First, express the algorithm in recursive form
Determine appropriate metric for input size 'n'
Then, find recurrence relation for the algorithm
Solve the recurrence relation using either Substitution method, recursion tree method or the master theorem

For example, the recurrence relations for popular algorithms are: \begin{align} \text{Binary search: } & T(n) & = T(\frac{n}{2}) + O(1) \\ \text{Sequential search: } & T(n) & = T(n-1) + O(1) \\ \text{Tree traversal: } & T(n) & = 2T(\frac{n}{2}) + O(1) \\ \text{Selection sort: } & T(n) & = T(n-1) + O(n) \\ \text{Merge sort: } & T(n) & = 2T(\frac{n}{2}) + O(n) \end{align} Try writing the pseudo code for any of the above algorithm and verify the recurrence relations.

Master theorem:

Master theorem can only solve the recurrence relation of the form: T(n) = a T(n/b) + f(n). Here is my modus operandi for master theorem: \begin{align} \text{If } f(n)=n^{log_{b}a}lg^{k+1}n, & \text{ then } T(n) = \Theta(n^{log_{b}a} lg^{k+1}n) \\ \text{Else if } n^{log_{b}a} \text{ is polynomially greater than } f(n), & \text{ then } T(n) = \Theta(n^{log_{b}a}) \\ \text{Else if } f(n) \text{ is polynomially greater than } n^{log_{b}a}, & \text{ then } T(n) = \Theta(f(n)) \\ \text{Else use Substitution method}& \end{align}
Here are solutions to two important recurrence relations, that cannot be solved using Master theorem: \begin{align} \text{Recurrence relation} & & \text{Complexity} \\ T(n) = T(n-1) + b n^k & => & O(n^{k+1})\\ T(n) = cT(n-1) + b n^k & => & O(c^n) \end{align}
For recurrence relation of other forms, refer to CLR for substitution and recursion tree method.
(In short, for substitution method, you guess the solution and use induction to find the constants and to prove that the guessed solution is correct.
Recursion tree method are often used to find the guess for the substitution method)

Some useful approximations:

For non-negative real b,c,d:
\begin{align} \text{Polynomial growth rate:}& \sum_{i=1}^n i^c &= \Theta(n^{c+1}) \\ \text{Logarithmic growth rate:}& \sum_{i=1}^n \dfrac{1}{i} &= \Theta(log(n)) \\ \text{Exponential growth rate:}& \sum_{i=1}^n c^i &= \Theta(c^n) \\ \text{n-Log(n) growth rate:}& \sum_{i=1}^n log(i)^c &= \Theta(n log(n)^c) \\ \text{Combination of above:}& \sum_{i=1}^n log(i)^c i^{d} &= \Theta(n^{d+1} log(n)^c) \\ &\sum_{i=1}^n log(i)^c i^d b^{i} &= \Theta(n^{d+1} log(n)^c b^{n}) \\ \text{Stirling's approximation:}& n! &= \sqrt(2 \pi n) (\frac{n}{e})^n (1 + \Theta(\frac{1}{n})) \end{align}

Presentation on Online Aggregation for Large MapReduce Jobs

2011-09-29T23:26:00.003-04:00

This post has two purposes:
1. Give information about my Comp 600 presentation at Rice University
2. Enumerate the lesson's learnt while preparing and giving the presentation.

(If the above videos does not work, click on: Dailymotion or Vimeo)

Also, if you have time, go through my slides or better still read my paper :)
- Slides
- Paper

FAQs during the presentation at VLDB:

1. What if the blocks are equally sized and spread across uniformly ?
- We used exactly that configuration. The input blocks of wikipedia logs are approximately of same size and HDFS does good job of spreading the data across uniformly. Still, we found biased results in the beginning.

2. Does your system replace general database ?
- Nope. As of now, it only performs single table aggregation with group-by. For a general online database, see DBO paper by our group.

3. Will you release your code ?
- Yes, we intend to release our code as open-source. For current version, see http://code.google.com/p/hyracks-online-aggregation/

4. Who do you see as potential user of the system ?
- Any company that has massive data stored in HDFS (logically as single table) and want to perform aggregation in online fashion or only till certain accuracy. Note, that the table containing the data need not be pre-randomized.

5. What does user of the system needs to know ?
- Just how to write and run a hadoop job ... All he/she has to do is write a hadoop mapper and set appropriate parameters in job configuration. For details see appendix of our paper.

6. You didn't explain too much statistics in the talk ... I want to know more
- Yes, you are absolutely right. Infact, that was intentional :) ... If I have to give one-line summary of statistics here is it: Sample from a Multi-variate normal on value, processing time and scheduling time (which have correlation factor associated with between each of them) which takes in account the 'lower bound' information of processing time of the blocks that are currently been processed ... rather than simply sampling from the distribution over data.

7. You don't talk about performance ... Isn't randomized scheduling policy bad for performance ?
- We didnt talk about the performance, mainly because we didnot provide any comparative study for performance in our paper. Note that, our policy is not same as a strict randomized policy. We expect the master to maintain a randomize queue and start the timer when a worker asks for the block. If for some reason (say locality), worker doesnot want that block or master has some internal (locality-based) scheduling algorithm, he can leave the timer ON and go for next block in the randomized queue. This is necessary machinery for sampling theory. This will yield performance that is similar to locality-based scheduling. However, in our current implementation, we have not used this mechanism. We intend to do the performance study in our future work.

Presentation tips:

Clearly my presentation has lot of room for improvement, for example: need to talk slower, more uniform layout of the slides, more interaction with audience etc. However, I have learnt a great deal while preparing and giving this presentation. I will summarize few points that I (and probably lot of graduate students) need to be careful about. For more detailed analysis of these points, I suggest you read Presentation Zen, Slide:ology and When the scientist present.

Step 1: Planning your presentation
1. Create a relaxing environment
- Go to a library, coffee place, garden (place where no one will disturb you)
- Take time to relax and forget about daily chores.
- Don't bring your laptop. In this step, you will create rough sketch of the presentation on paper (not slideware)
- Lot of the time, changing locations helps you to be more creative.

2. Make sure you understand the big picture of your presentation.
Can you find 30-45 seconds answer to following questions:
- What is the single most important "key" point that I want my audience to remember
- Why should that point matter to the audience

Also, make sure that your presentation title has keywords that highlights the big picture (as many people (if not all) will come to your talk either reading just the title (and abstract) or your name). This does not mean the title has to be too long; in fact shorter and thought-provoking title are preferred.

3. Now, create a sketch of the presentation. This involves enumerating key ideas you will be presenting and also organising/ordering them. You can use either or all of the following tools:
- Mindmap on whiteboard or paper
- Post-it (use sharpie to make sure you don't misuse post-it)
- Brainstorm the list of ideas + Grouping them
- Sketching or drawing
Nancy Duarte says most effective presentations involves telling a story that audience can connect and relate to, rather than just present the information/data. Also, best professors are good story-tellers. They just don't go through the material in the book, but put their own personality, character and experience into the material in form of a narrative, which is illuminating, engaging and memorable.

The most useful point I have learnt from Presentation Zen is that the presentation consists of three parts:
- Slides the audience will see (which you will create in Step 2)

- Notes only you will see (which you have created in this Step)

- Handouts that audience can take away (so that they don't have to jot down key definition or formula). This in worst case could simply be a copy of your paper.

This essentially means that the slides are not telepromptor, that you read aloud throughout your presentation.

Step 2: Design the slides
To succeed as a presenter, you need to think as a designer (from slide:logy).

1. First few slides are very very important !!!
- I like to give use them to tell the audience the big picture of the presentation.
- It is also necessary that this has to be catch their attention, as we all know the attention-span of audience decreases exponentially after first few minutes. Don't use the first few minutes reading through the 'Outline' slide.
- Use following principle: Tell them what you are going to present, present it, then tell them again what you presented :)

2. Slides should be as simple as possible.
- Don't use fancy, distracting background images. I prefer sticking to simple white background slide.
- Good design has plenty of empty spaces.
- Avoid typical bullet point presentations. They are boring and they do more harm than good.
- Don't use 3D figures when 2D figures can suffice
- Never ever clutter the slides with useless data, words, figures, images, etc

3. 1 message per slide ... not more !!
- Do you understand that message ?
- Does your slide reflect that message ? You don't have to write down the message into your slide
- Can I make this slide simpler and still make sure that the message is conveyed. In most cases, this is done by replace bunch of text either with an appropriate image or animation.
- A lot of the time, entire message can be conveyed just by a quote from a reliable or important person.

4. Try to find a good figure or animation to convey a complex idea, algorithm or system. Here are some tips on creating animation:
- Using same slide background, create many slides (by copy operation). Change only 1 detail per slide. When you keep moving from one slide to next, it would give an effect of animation, something like flip-book. However, this process creates a lot of slides (which is problematic in many cases). So, if you want to fit everything into 1 slide, you can create a flash object, gif file or avi file and embed it into the slide. Most people use Adobe tools to create professional looking flash files, but here is poor man's solution:
- Use flip-book style slide transition technique and desktop recorder (eg: recordmydesktop) to create a movie. It usually creates movie in ogv file format.
- Also, lot of time you need to create animation using statistical distribution. In that case, use animation library of R and create ogv movie using desktop recorder. (This library also allows you store the animation in gif, but I am having some problems with it).

# R code for animation
library(animation)
ani.options(interval = 1, nmax = 20)
sd1 <- 41
for (i in 1:ani.options("nmax")) {
    x=seq(900, 1100,length=200)
    y=dnorm(x,mean=1000,sd=sd1)
    plot(x,y,type="l",lwd=2,col="red")
    ani.pause() ## pause for a while ('interval')
    sd1 <- sd1 - 2
}

a. Convert OGV to AVI:
mencoder input.ogv -ovc xvid -oac mp3lame -xvidencopts pass=1 -speed 0.5 -o output.avi
b. Convert OGV to GIF (or swf):
ffmpeg -i input.ogv -loop_output 0 -pix_fmt rgb24 -r 10 -s 320x240 output.gif
"-loop_output 0" means loop forever.

5. Is the content tailored to meet audience' expectation:
- Employers/future collaborators will try to evaluate you rather than the topic.
- Students (undergrad/first year grads/ ..) or people from different area want to understand the problem and intuition behind the solution. Hence, they are more interested in motivation section.
- Co-author will look for recognition.
- Sponsor will look for acknowledgement.
- People who work in same area will try to see if how your system/solution compares to theirs (i.e. related work) or may be are there to look for a new problem (i.e. your future work).

6. Other design tips:
- Use contrast (using font type, font color, font size, shape, color, images) accentuate the difference. (If it is different, make it very different - Garr Reynolds). It helps audience to identify the message of the slide quickly.
- Repetition: Use same image (or anchor) to represent the same concept throughout the presentation.
- Alignment: Use grids, invisible lines. Audience needs to know the order in which to process the information.
- Proximity
- Hierarchy, Unity
- Use high quality graphics. There are websites that create and sell them like www.istockphoto.com; or even give them for free like http://www.flickr.com/creativecommons, etc.
- Nothing distracts the audience more than a cluttered slide.
- Make sure your font size is greater than 24 point (so that audience can read it from the projector)
- Use comics ... it not only conveys the information efficiently, but also makes the presentation playful. Though many of my colleagues disagree with this, I believe most audience wants to have fun even if it is a research presentation .. so lighten the mood, be creative and "artsy" with your slides :)

"The more strikingly visual your presentation is, the more people will remember it. And more importantly they will remember you" - Paul Arden.
Here are some slides that are well designed:
http://www.slideshare.net/jbrenman
http://www.slideshare.net/chrislandry
http://www.slideshare.net/gkawasaki
http://www.slideshare.net/garr

Step 3: Rehearse and re-rehearse
- Use both notes and slides.
- Rehearse until you can remember and are comfortable with them. Imagine the worst case scenario that you lost your laptop, slides and all your notes ... and you have to give your presentation !!! ... If you can do that, you are ready :)
- Here you also do the job of editing, if you feel certain slides are just fillers and can be removed ... be brutal and delete them.
- Listen to http://www.ted.com/talks for motivation on how to present.

Step 4: Delivery
- Go to the presentation room 15 min before the scheduled time and make sure your slides are compatible, projector works, ... Also, bring a bottle of water, just in case.
- Leave the lights on and face the audience. People are there to see you, not read through the slides. Use wireless presenter so that you don't need to hide behind the podium/monitor.
- Slides are not telepromptor ... so don't keep staring at the monitor ... connect with the audience.

Remember these two quotes:
- Every presentation has potential to be great; every presentation is high stakes; and every audience deserves the absolute best - Nancy Duarte.
- You are presenting yourself rather than the topic. So, negative impact of poor presentation far exceeds failure of getting your point across ... Jean-Luc Lebrun

Thinking in Probability

2011-08-15T01:18:00.002-04:00

This blogpost is based on a lecture I gave at Jacob Sir's classes yesterday. The intent of my lecture was to urge student to ask questions and read "stuff" beyond the textbook. To get the interest of the students, I started with 2 examples, which explained the dependent and independent events and also the need to stick to mathematical rigor rather than intuition.

First example was classic Monte Hall problem, which suprisingly none of my students had heard about. This was fortunate because this meant everyone would have to think about it on their own, rather than provide a prepared answer (thought-through by someone else). So, here is the problem:

There are three doors D1, D2 and D3; behind two of them there is a goat and behind the other is a car. The objective of the game is to win a car. So, you have to guess which door to chose ... Does it make difference whether you chose D1, D2 or D3 ? ... The consensus was NO, because the Prob(win)=0.33 for each of the door. To make the problem interesting the game show host (who know which door has car and which doors has goats), opens one of the remaining door that has goat. He now asks you based on this information whether you would like to switch your earlier choice. For example, earlier you chose D1 and the game show host opens D2 and shows that it has goat. Now you can either stick to D1 or switch to D3. The real question we are interested in is:

Does switching the door make more sense or staying with the same door ? or It does not matter whether you chose D1 or D3.

Interesting but incorrect answers:

1. It does not matter whether you chose D1 or D3, because Prob(win) for each door is now 0.5

2. Staying with the door makes more sense, since we have increased the Prob(win) from 0.3 to 0.5

3. We really don't have plausible reason to switch, hence stay with the same door; especially since the game show host may want to trick us.

The correct answer is switching doubles the Prob(win), hence it makes more sense.

Consider the case where you don't switch:

Prob(win) = Prob(D1 = car) = 0.33 (The probability does not change because of an event outside its scope)

Now consider the case where you switch:

Prob(win) = Prob(D1 = goat) = 0.66

The other way to understand the problem is by enumerating

D1 : C G G

D2 : G C G

D3 : G G C

Swap: G C C => Prob(car) = 0.66

Stay: C G G => Prob(car) = 0.33

For people who chose the third incorrect answer, I asked them not to use information not provided in the problem and show as much restrain as possible to use intuition over logic. Deductive thinking says go from Step 2 to Step 3 only if there is a valid theorem, axiom, or logical reason (not intuition).

Let's move to the second problem:

Say I have a fair coin. I toss the coin four times and I get {Head, Head, Head, Head}, what will you bet on ?

Some students earlier based their answer on gambler's fallacy, stating that Tail is more likely since they have seen four consecutive heads. I then asked them to sit in circle and discuss with each other and then answer my question again. Through this exercise, I wanted them to learn the idea of collaboration to get an answer. And vola ... they came with the correct answer ... It doesnot matter whether we bet on head or tail, since the events are INDEPENDENT and hence each outcome is equally likely => Prob(T5 | H1, H2, H3, H4) = Prob(T5) = 0.5

Finally, I gave them two different explanation of probability:

1. Deterministic view for probability (explained in my earlier blogpost)

2. Measure theoretic approach

After explaining the deterministic view, I emphasized the importance of probability while solving real life problem. For example, in coin tossing experiment, you are able to make some rational predictions, even though you don't know (hidden) parameters like initial angle of coin, weight and density of coin, shape of the coin, velocity of the coin, resistance due to air, characteristics of surface where it lands, etc ... Similarly, you can build simpler probabilistic model for complex scenarios like weather forecasting, stock prediction, traffic congestion, ...

Reference:

1. http://niketanblog.blogspot.com/2009/11/god-does-not-play-dice.html

2. http://en.wikipedia.org/wiki/Monty_Hall_problem

3. http://en.wikipedia.org/wiki/Gambler's_fallacy

4. http://en.wikipedia.org/wiki/Measure_(mathematics)

Nice post about Sachin Tendular after India-SA match

2011-03-15T17:31:00.003-04:00

Remember when you failed an examination. How many people recall that, your class, friends, relatives? You failed to make it to the IITs or IIMs. Who remembers. How many times have you had the feeling of being the best in your class, school , university, state….., you failed to get a visa stamped this quarter…, you missed a promotion this year…, how did it feel when you dad told you in your early twenties that you are good for nothing…..and now your boss tell you the same...

You keep introspecting and go into a shell when people most of whom don’t matter a dime in your life criticize you, back bite you, make fun of you. You are left sad and shattered and you cry when your own kin scoffs at you. You say I am feeling low today. It takes a lot from us to come out of these everyday situations and move on. A lot??? really?

Now here’s a man standing on the third man boundary in the last over of a world cup match. The bowler just has to bowl sensibly to win this game. What the man at the boundary sees is 4 rank bad bowls bowled without any sense of focus, planning or regret. India loses, yet again in those circumstances when he has done just about everything right.
He does not cry. Does not show any emotion. Just keeps his head down and leaves the field. He has seen these failures for 22 years now. And not just his class, relatives, friends but the whole world has seen these failures. We are too immature to even imagine what goes on in that mind and heart of his. That’s why I would never want to be Sachin.

True, he has single handedly lifted to moods of this entire nation umpteen number of times. He has been an inspiration to rise above our mediocrity. Nobody who has ever lifted the willow even comes close to this man’s genius. His dedication and metal strength is unparallel. This is specially for those people who would have made fun of him again last night when India lost. They are people who are mediocre in their own lives. Who just scoff at others to create cheap fun. Who have lived in a small hole throughout their lives and thought they have seen the oceans.

Think about the man himself. He is 37 years of age. He has been playing almost non stop for 22 years. The way he was running and diving around the field last night would have put 22 year olds to shame. The way he played the best opening quickies in the world was breathtaking. He just keeps getting better which is by the way humanly impossible. Its not for nothing that people call him GOD.
But still I don’t want to be in those shoes. We struggle in keeping our monotonous lives straight, lives which affect a limited number of people. Imagine what would be the magnitude of the inner struggle for him, pain both mental and physical, tears that have frozen with time, knees and ankles and every other joint in the body that is either bandaged or needs to be attended to every night, eyes that don’t sleep before a big game, bats that have scored 99 international tons and still see expectations from a billion people.

And he just converts those expectations into reality. We watch in awe, feel privileged.
Well I think its time that his team realizes that enough is enough. They have an obligation, not towards their country alone but towards sachin. They need to win this one for him. Stay assured that he himself will still deliver and leave no stone unturned to make sure India wins this cup.
This is not just a game, and he is not just a sportsman. Its much more than this. Words fail here.....

-- Anonymous

(This post is not written by Harsha Bhogle even though the source page said it was)

Setting and running Hadoop 0.20.2

2011-02-24T16:37:00.001-05:00

Step 1: Download and extract hadoop code
wget http://mirror.cloudera.com/apache/hadoop/core/hadoop-0.20.2/hadoop-0.20.2.tar.gz
Extract the files and go into that directory

Step 2: Configure Hadoop
vim ~/.bash_profile
export JAVA_HOME=/usr/local/java/vms/java
export HADOOP_HOME=/home/oa/hadoop-asterix/hadoop-0.20.2

vim ${HADOOP_HOME}/conf/masters
asterix-master

vim ${HADOOP_HOME}/conf/slaves
asterix-001
asterix-002
asterix-003

vim ${HADOOP_HOME}/conf/core-site.xml
<property>
<name>fs.default.name</name>
<value>hdfs://10.122.198.195:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>

vim ${HADOOP_HOME}/conf/mapred-site.xml
<property>
<name>mapred.job.tracker</name>
<value>10.122.198.195:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>

vim ${HADOOP_HOME}/conf/hdfs-site.xml
<property>
<name>dfs.replication</name>
<value>2</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
<property>
<name>dfs.name.dir</name>
<value>/mnt/hdfs/name_dir/</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/mnt/hdfs/data_dir/</value>
</property>
<property>
<name>dfs.datanode.address</name>
<value>0.0.0.0:54325</value>
</property>
<property>
<name>dfs.datanode.http.address</name>
<value>0.0.0.0:54326</value>
</property>

vim ${HADOOP_HOME}/conf/hadoop-env.sh
export JAVA_HOME=/usr/local/java/vms/java

Make sure the name and data directory are properly setup using following script. This also confirms that you can do passwordless ssh from master to slave machines.
vim ${HADOOP_HOME}/commands.sh
cd /mnt
sudo mkdir hdfs
cd hdfs/
sudo mkdir name_dir
sudo mkdir data_dir
sudo chmod 777 -R /mnt/hdfs

# Sync this file
parallel-rsync -p 6 -r -h ${HADOOP_HOME}/conf/slaves ${HADOOP_HOME} ${HADOOP_HOME}

Then, run the above command.sh for every slave
user_name="ubuntu"
command="sh ${HADOOP_HOME}/commands.sh"
for slaves_ip in $(cat ${HADOOP_HOME}/conf/slaves)
do
ssh ${user_name}@${slaves_ip} ${command}
done

Step 3: Compile hadoop and sync slaves
# Compile source code
${ANT_HOME}/bin/ant
${ANT_HOME}/bin/ant jar
${ANT_HOME}/bin/ant examples

# Sync slaves (see EC2 point 3 for installing parallel ssh)
parallel-rsync -p 6 -r -h ${HADOOP_HOME}/conf/slaves ${HADOOP_HOME} ${HADOOP_HOME}

Step 4: Run hadoop
${HADOOP_HOME}/bin/hadoop namenode -format
${HADOOP_HOME}/bin/stop-all.sh
${HADOOP_HOME}/bin/start-all.sh

Check logs/ datanode namenode. Also see if the nodes are up, by using (any normal browser or) links http://master-ip:50070
If you get "incompatible namespace error" in datanodes log, let me try deleting hdfs dir and restarting hdfs

Step 4: Loading the HDFS
If you are loading from:
1. Local filesystem of master, use: either copyFromLocal or put
bin/hadoop dfs -copyFromLocal /mnt/wikipedia_input/wikistats/pagecounts/pagecounts* wikipedia_input

2. Some other hdfs, use put

3. Files on some other machine accessible via scp
# Configure following variables. Keep space between parenthesis of array and each item (no comma).
# If this script gives error like '4: Syntax error: "(" unexpected', try bash <script-name>
# If that gives permission denied error, put name of directories instead of ${directories1[@]}
user_name="ubuntu"
machine_name="my_machine_name_or_ip"
file_prefix="pagecounts*"
hdfs_dir="wikipedia_input"
directories1=( "/mnt/data/sdb/space/oa/wikidata/dammit.lt/wikistats/archive/2010/09" "/mnt/data/sdc/space/oa/wikidata/dammit.lt/wikistats/archive/2010/10" "/mnt/data/sdd/space/oa/wikidata/dammit.lt/wikistats/archive/2010/11" )
for dir1 in ${directories1[@]}
do
echo "--------------------------------------------"
cmd="ssh ${user_name}@${machine_name} 'ls ${dir1}/$file_prefix'"
echo "Reading files using: " $cmd
for file1 in `eval $cmd`
do
file_name1=${file1##*/}
echo -n $file_name1 " "
scp oa@asterix-001:$file1 .
bin/hadoop dfs -copyFromLocal $file_name1 $hdfs_dir
rm $file_name1
done
done

Step 5: Run your mapreduce program
${HADOOP_HOME}/bin/hadoop jar ${HADOOP_HOME}/build/hadoop-hop-0.2-examples.jar wordcount tpch_input tpch_output

Some tips for amazon EC2:
1. Lot of times due to resource allocation policies, EC2 shutdowns your virtual machine (hence the assigned network) and master/slaves/namenodes/datanodes goes into fault tolerance mode and restart the jobs. You can set infinite time for heartbeat to tackle this error (This works because EC2 restarts you virtual machine after some time and there is no "real" failure, just temporary lags) by setting following into ${HADOOP_HOME}/conf/hdfs-site.xml
<property>
<name>dfs.heartbeat.interval</name>
<value>6000</value>
<description>Determines datanode heartbeat interval in seconds.</description>
</property>
<property>
<name>dfs.heartbeat.recheck.interval</name>
<value>6000</value>
<description>Determines datanode heartbeat interval in seconds.</description>
</property>
<property>
<name>heartbeat.recheck.interval</name>
<value>6000</value>
<description>If dfs... doesnot work</description>
</property>
<property>
<name>dfs.socket.timeout</name>
<value>180000</value>
<description>dfs socket timeout</description>
</property>

2. Login into EC2 machine:
EC2_KEYPAIR_DIR="/home/np6/EC2"
echo "\nEnter Public DNS of Master"
read AMAZON_PUBLIC_DNS
echo "If this doesnot work try, (exec ssh-agent bash) and then this command again"
ssh-agent
ssh-add ${EC2_KEYPAIR_DIR}/ec2-keypair.pem
ssh ubuntu@$AMAZON_PUBLIC_DNS

3. Setting up java and other programs on slaves from master
# First install parallel ssh
user_name="ubuntu"
command="sudo apt-get install pssh"
for slaves_ip in $(cat ${HADOOP_HOME}/conf/slaves)
do
ssh ${user_name}@${slaves_ip} ${command}
done

# Then install java (if you dont prefer openjdk)
vim ${HADOOP_HOME}/commands.sh
sudo add-apt-repository "deb http://archive.canonical.com/ lucid partner"
sudo apt-get update
sudo apt-get install sun-java6-jdk
sudo update-java-alternatives -s java-6-sun
echo "export JAVA_HOME=/usr/lib/jvm/java-6-sun" >> ~/.bashrc
echo "export HADOOP_HOME=/home/ubuntu/hadoop" >> ~/.bashrc
source ~/.bashrc
echo "Check if the version of java is correct:"
java -version

# Sync this file
parallel-rsync -p 6 -r -h ${HADOOP_HOME}/conf/slaves ${HADOOP_HOME} ${HADOOP_HOME}

Then, run the above command.sh for every slave
user_name="ubuntu"
command="sh ${HADOOP_HOME}/commands.sh"
for slaves_ip in $(cat ${HADOOP_HOME}/conf/slaves)
do
ssh ${user_name}@${slaves_ip} ${command}
done

4. Setting up hadoop master and slaves for lazy person (I would recommend you follow above steps instead)
cd ${HADOOP_HOME}/conf
echo "\nEnter Public DNS of Master"
read AMAZON_PUBLIC_DNS
sed -e "s/<name>mapred.job.tracker<\/name> <value>[-[:graph:]./]\{1,\}<\/value>/<name>mapred.job.tracker<\/name> <value>${AMAZON_PUBLIC_DNS}<\/value>/" hadoop-site.xml > a1.txt
sed -e "s/<name>fs.default.name<\/name> <value>[-[:graph:]./]\{1,\}<\/value>/<name>fs.default.name<\/name> <value>hdfs:\/\/${AMAZON_PUBLIC_DNS}:9001<\/value>/" a1.txt > hadoop-site.xml
echo ${AMAZON_PUBLIC_DNS} > masters
echo "\nEnter Slave string seperated with space (eg: domU-12-31-39-09-A0-84.compute-1.internal domU-12-31-39-0F-7E-61.compute-1.internal)"
read SLAVE_STR
echo $SLAVE_STR | sed -e "s/ /\n/" > slaves

Some other neat tricks:
1. Replace default java temp directory:
export JAVA_OPTS="-Djava.io.tmpdir=/mnt/java_tmp"

2. Setting number of open file limit to 99999
sudo vi /etc/security/limits.conf
ubuntu soft nofile 99999
ubuntu hard nofile 99999
* soft nofile 99999
* hard nofile 99999
sudo sysctl -p
ulimit -Hn

3. Checking the machines on the network
cat /etc/hosts

or naming machines as masters and slaves: vim /etc/hosts
10.1.0.1 asterix-master
127.0.0.1 localhost
10.0.0.1 asterix-001
10.0.0.2 asterix-002

Checking machines ip address
/sbin/ifconfig

4. Configuring password-less ssh of master to slaves
slave_user_name="ubuntu"
for slaves_ip in $(cat ${HADOOP_HOME}/conf/slaves)
do
ssh-copy-id -i $HOME/.ssh/id_rsa.pub ${slave_user_name}@${slaves_ip}
done

For more detailed step by step example, see
http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/