## Important assumptions:

•  This process of analysis should be independent of programming language and machine architecture. So you can't just write your algorithm in C++ and compare it using running time of some other algorithm written in Java.
• Also, since each algorithm can take as an input arbitrary data structure (eg: graph, array, binary object, tree, ...), there has to be some standard way to specify it. For sake of our analysis, we use input size (n) as the measure for our analysis. For eg: length of array, number of vertices or edges of the graph, number of nodes of a tree etc.
The detailed explanations of these two points is given in Introduction to Algorithms by Cormen, Leiserson, Rivest, .. (CLR)

## Setup:

Input of size n ==> Algorithm that takes T(n) steps ==> some output

The number of steps in the algorithm often depends on factors other than the input size. For eg: the number of steps for quicksort for the input [1, 2, 3, 4, 5] is different than as compared to the input [5, 4, 3, 2, 1]; even though both have same input size (and in this case same elements).
Hence, we will try to find 3 functions (upper bound, lower bound and Theta bound) that can capture the effects of these factors: $$g_1(n), g_2(n) \text{ and } g_3(n) \text{ such that }$$ $$T(n) \in O(g_1(n)), T(n) \in \Omega(g_2(n)) \text{ and } T(n) \in \Theta(g_3(n))$$ Many authors also write above equations as: $$T(n) = O(g_1(n)), T(n) = \Omega(g_2(n)) \text{ and } T(n) = \Theta(g_3(n))$$

## Big-O notations:

Now, let's understand definition of above notations: \begin{align} \text{To prove that } & g_1(n) \text{ is upper bound of T(n), i.e.} T(n) \in O(g_1(n)) \\ \text{ find two constants } & c, n_0 \text{ such that } \\ & \forall n \geq n_0, T(n) \leq c g_1(n) \end{align} For lower bound, change ≤ sign to ≥ sign and for theta bound find both lower and upper bound: $$T(n) \in \Theta(g_3(n)) \text{ iff } T(n) \in O(g_3(n)) \text{ and } T(n) \in \Omega(g_3(n))$$
To understand these in more details, I suggest you solve problems from CLR chapter 3. Here are some example you can start with: \begin{align} 2n^2 \in O(n^3) & \text{ with } c=1 \text{ and } n_0 = 2 \\ \sqrt{n} \in \Omega(lg n) & \text{ with } c=1 \text{ and } n_0 = 16 \\ \frac{n^2}{2} - 2n \in \Theta(n^2) & \text{ with } c_1=\frac{1}{4}, c_2=\frac{1}{2} \text{ and } n_0 = 8 \\ \end{align} (Note: theoretically very large functions such as infinity^n or n^infinity can always be lower bounds. So, it is necessary to supply as tight bound as possible for any given algorithm. Mathematically, the notations for tight upper bound is small o and for tight lower bound is small omega, but we will ignore them for sake of this discussion)

## Common measures for algorithms:

Above three notations are often replaced by the terms 'worst case', 'best case' and 'average case' complexity. But, more precise definitions are as follows: \begin{align} \text{Worst case}& T(n) = max \{ T(x) | \text{ x is an instance of size n} \} \\ \text{Best case}& T(n) = min \{ T(x) | \text{x is an instance of size n} \} \\ \text{Average case}& T(n) = \sum_{\text{input }x} Prob(x) T(x) \end{align} This means, to find average case complexity of algorithm, you need to find probability of a given input and use it to find expectation of T(n).

## Pseudo code to Big-O:

Now, let's address the key question: given a pseudo code, how do we find its lower bound (or upper bound or ...). Here is one possible way to find those bounds:
• First, express the algorithm in recursive form
• Determine appropriate metric for input size 'n'
• Then, find recurrence relation for the algorithm
• Solve the recurrence relation using either Substitution method, recursion tree method or the master theorem

For example, the recurrence relations for popular algorithms are: \begin{align} \text{Binary search: } & T(n) & = T(\frac{n}{2}) + O(1) \\ \text{Sequential search: } & T(n) & = T(n-1) + O(1) \\ \text{Tree traversal: } & T(n) & = 2T(\frac{n}{2}) + O(1) \\ \text{Selection sort: } & T(n) & = T(n-1) + O(n) \\ \text{Merge sort: } & T(n) & = 2T(\frac{n}{2}) + O(n) \end{align} Try writing the pseudo code for any of the above algorithm and verify the recurrence relations.

## Master theorem:

Master theorem can only solve the recurrence relation of the form: T(n) = a T(n/b) + f(n). Here is my modus operandi for master theorem: \begin{align} \text{If } f(n)=n^{log_{b}a}lg^{k+1}n, & \text{ then } T(n) = \Theta(n^{log_{b}a} lg^{k+1}n) \\ \text{Else if } n^{log_{b}a} \text{ is polynomially greater than } f(n), & \text{ then } T(n) = \Theta(n^{log_{b}a}) \\ \text{Else if } f(n) \text{ is polynomially greater than } n^{log_{b}a}, & \text{ then } T(n) = \Theta(f(n)) \\ \text{Else use Substitution method}& \end{align}
Here are solutions to two important recurrence relations, that cannot be solved using Master theorem: \begin{align} \text{Recurrence relation} & & \text{Complexity} \\ T(n) = T(n-1) + b n^k & => & O(n^{k+1})\\ T(n) = cT(n-1) + b n^k & => & O(c^n) \end{align}
For recurrence relation of other forms, refer to CLR for substitution and recursion tree method.
(In short, for substitution method, you guess the solution and use induction to find the constants and to prove that the guessed solution is correct.
Recursion tree method are often used to find the guess for the substitution method)

## Some useful approximations:

For non-negative real b,c,d:
\begin{align} \text{Polynomial growth rate:}& \sum_{i=1}^n i^c &= \Theta(n^{c+1}) \\ \text{Logarithmic growth rate:}& \sum_{i=1}^n \dfrac{1}{i} &= \Theta(log(n)) \\ \text{Exponential growth rate:}& \sum_{i=1}^n c^i &= \Theta(c^n) \\ \text{n-Log(n) growth rate:}& \sum_{i=1}^n log(i)^c &= \Theta(n log(n)^c) \\ \text{Combination of above:}& \sum_{i=1}^n log(i)^c i^{d} &= \Theta(n^{d+1} log(n)^c) \\ &\sum_{i=1}^n log(i)^c i^d b^{i} &= \Theta(n^{d+1} log(n)^c b^{n}) \\ \text{Stirling's approximation:}& n! &= \sqrt(2 \pi n) (\frac{n}{e})^n (1 + \Theta(\frac{1}{n})) \end{align}

## Thursday, September 29, 2011

### Presentation on Online Aggregation for Large MapReduce Jobs

This post has two purposes:
1. Give information about my Comp 600 presentation at Rice University
2. Enumerate the lesson's learnt while preparing and giving the presentation.

(If the above videos does not work, click on: Dailymotion or Vimeo)

Also, if you have time, go through my slides or better still read my paper :)
- Slides
- Paper

## FAQs during the presentation at VLDB:

1. What if the blocks are equally sized and spread across uniformly ?
- We used exactly that configuration. The input blocks of wikipedia logs are approximately of same size and HDFS does good job of spreading the data across uniformly. Still, we found biased results in the beginning.

2. Does your system replace general database ?
- Nope. As of now, it only performs single table aggregation with group-by. For a general online database, see DBO paper by our group.

3. Will you release your code ?
- Yes, we intend to release our code as open-source. For current version, see http://code.google.com/p/hyracks-online-aggregation/

4. Who do you see as potential user of the system ?
- Any company that has massive data stored in HDFS (logically as single table) and want to perform aggregation in online fashion or only till certain accuracy. Note, that the table containing the data need not be pre-randomized.

5. What does user of the system needs to know ?
- Just how to write and run a hadoop job ... All he/she has to do is write a hadoop mapper and set appropriate parameters in job configuration. For details see appendix of our paper.

6. You didn't explain too much statistics in the talk ... I want to know more
- Yes, you are absolutely right. Infact, that was intentional :) ... If I have to give one-line summary of statistics here is it: Sample from a Multi-variate normal on value, processing time and scheduling time (which have correlation factor associated with between each of them) which takes in account the 'lower bound' information of processing time of the blocks that are currently been processed ... rather than simply sampling from the distribution over data.

7. You don't talk about performance ... Isn't randomized scheduling policy bad for performance ?
- We didnt talk about the performance, mainly because we didnot provide any comparative study for performance in our paper. Note that, our policy is not same as a strict randomized policy. We expect the master to maintain a randomize queue and start the timer when a worker asks for the block. If for some reason (say locality), worker doesnot want that block or master has some internal (locality-based) scheduling algorithm, he can leave the timer ON and go for next block in the randomized queue. This is necessary machinery for sampling theory. This will yield performance that is similar to locality-based scheduling. However, in our current implementation, we have not used this mechanism. We intend to do the performance study in our future work.

## Presentation tips:

Clearly my presentation has lot of room for improvement, for example: need to talk slower, more uniform layout of the slides, more interaction with audience etc. However, I have learnt a great deal while preparing and giving this presentation. I will summarize few points that I (and probably lot of graduate students) need to be careful about. For more detailed analysis of these points, I suggest you read Presentation Zen, Slide:ology and When the scientist present.

1. Create a relaxing environment
- Go to a library, coffee place, garden (place where no one will disturb you)
- Take time to relax and forget about daily chores.
- Don't bring your laptop. In this step, you will create rough sketch of the presentation on paper (not slideware)
- Lot of the time, changing locations helps you to be more creative.

2. Make sure you understand the big picture of your presentation.
Can you find 30-45 seconds answer to following questions:
- What is the single most important "key" point that I want my audience to remember
- Why should that point matter to the audience

Also, make sure that your presentation title has keywords that highlights the big picture (as many people (if not all) will come to your talk either reading just the title (and abstract) or your name). This does not mean the title has to be too long; in fact shorter and thought-provoking title are preferred.

3. Now, create a sketch of the presentation. This involves enumerating key ideas you will be presenting and also organising/ordering them. You can use either or all of the following tools:
- Mindmap on whiteboard or paper
- Post-it (use sharpie to make sure you don't misuse post-it)
- Brainstorm the list of ideas + Grouping them
- Sketching or drawing
Nancy Duarte says most effective presentations involves telling a story that audience can connect and relate to, rather than just present the information/data. Also, best professors are good story-tellers. They just don't go through the material in the book, but put their own personality, character and experience into the material in form of a narrative, which is illuminating, engaging and memorable.

The most useful point I have learnt from Presentation Zen is that the presentation consists of three parts:
- Slides the audience will see (which you will create in Step 2)
- Notes only you will see (which you have created in this Step)
- Handouts that audience can take away (so that they don't have to jot down key definition or formula). This in worst case could simply be a copy of your paper.
This essentially means that the slides are not telepromptor, that you read aloud throughout your presentation.

Step 2: Design the slides
To succeed as a presenter, you need to think as a designer (from slide:logy).

1. First few slides are very very important !!!
- I like to give use them to tell the audience the big picture of the presentation.
- It is also necessary that this has to be catch their attention, as we all know the attention-span of audience decreases exponentially after first few minutes. Don't use the first few minutes reading through the 'Outline' slide.
- Use following principle: Tell them what you are going to present, present it, then tell them again what you presented :)

2. Slides should be as simple as possible.
- Don't use fancy, distracting background images. I prefer sticking to simple white background slide.
- Good design has plenty of empty spaces.
- Avoid typical bullet point presentations. They are boring and they do more harm than good.
- Don't use 3D figures when 2D figures can suffice
- Never ever clutter the slides with useless data, words, figures, images, etc

3. 1 message per slide ... not more !!
- Do you understand that message ?
- Does your slide reflect that message ? You don't have to write down the message into your slide
- Can I make this slide simpler and still make sure that the message is conveyed. In most cases, this is done by replace bunch of text either with an appropriate image or animation.
- A lot of the time, entire message can be conveyed just by a quote from a reliable or important person.

4. Try to find a good figure or animation to convey a complex idea, algorithm or system. Here are some tips on creating animation:
- Using same slide background, create many slides (by copy operation). Change only 1 detail per slide. When you keep moving from one slide to next, it would give an effect of animation, something like flip-book. However, this process creates a lot of slides (which is problematic in many cases). So, if you want to fit everything into 1 slide, you can create a flash object, gif file or avi file and embed it into the slide. Most people use Adobe tools to create professional looking flash files, but here is poor man's solution:
- Use flip-book style slide transition technique and desktop recorder (eg: recordmydesktop) to create a movie. It usually creates movie in ogv file format.
- Also, lot of time you need to create animation using statistical distribution. In that case, use animation library of R and create ogv movie using desktop recorder. (This library also allows you store the animation in gif, but I am having some problems with it).

# R code for animation
library(animation)
ani.options(interval = 1, nmax = 20)
sd1 <- 41
for (i in 1:ani.options("nmax")) {
x=seq(900, 1100,length=200)
y=dnorm(x,mean=1000,sd=sd1)
plot(x,y,type="l",lwd=2,col="red")
ani.pause() ## pause for a while ('interval')
sd1 <- sd1 - 2
}
a. Convert OGV to AVI:
mencoder input.ogv -ovc xvid -oac mp3lame -xvidencopts pass=1 -speed 0.5 -o output.avi
b. Convert OGV to GIF (or swf):
ffmpeg -i input.ogv -loop_output 0 -pix_fmt rgb24 -r 10 -s 320x240 output.gif
"-loop_output 0" means loop forever.

5. Is the content tailored to meet audience' expectation:
- Employers/future collaborators will try to evaluate you rather than the topic.
- Students (undergrad/first year grads/ ..) or people from different area want to understand the problem and intuition behind the solution. Hence, they are more interested in motivation section.
- Co-author will look for recognition.
- Sponsor will look for acknowledgement.
- People who work in same area will try to see if how your system/solution compares to theirs (i.e. related work) or may be are there to look for a new problem (i.e. your future work).

6. Other design tips:
- Use contrast (using font type, font color, font size, shape, color, images) accentuate the difference. (If it is different, make it very different - Garr Reynolds). It helps audience to identify the message of the slide quickly.
- Repetition: Use same image (or anchor) to represent the same concept throughout the presentation.
- Alignment: Use grids, invisible lines. Audience needs to know the order in which to process the information.
- Proximity
- Hierarchy, Unity
- Use high quality graphics. There are websites that create and sell them like www.istockphoto.com; or even give them for free like http://www.flickr.com/creativecommons, etc.
- Nothing distracts the audience more than a cluttered slide.
- Make sure your font size is greater than 24 point (so that audience can read it from the projector)
- Use comics ... it not only conveys the information efficiently, but also makes the presentation playful. Though many of my colleagues disagree with this, I believe most audience wants to have fun even if it is a research presentation .. so lighten the mood, be creative and "artsy" with your slides :)

"The more strikingly visual your presentation is, the more people will remember it. And more importantly they will remember you" - Paul Arden.
Here are some slides that are well designed:
http://www.slideshare.net/jbrenman
http://www.slideshare.net/chrislandry
http://www.slideshare.net/gkawasaki
http://www.slideshare.net/garr

Step 3: Rehearse and re-rehearse
- Use both notes and slides.
- Rehearse until you can remember and are comfortable with them. Imagine the worst case scenario that you lost your laptop, slides and all your notes ... and you have to give your presentation !!! ... If you can do that, you are ready :)
- Here you also do the job of editing, if you feel certain slides are just fillers and can be removed ... be brutal and delete them.
- Listen to http://www.ted.com/talks for motivation on how to present.

Step 4: Delivery
- Go to the presentation room 15 min before the scheduled time and make sure your slides are compatible, projector works, ... Also, bring a bottle of water, just in case.
- Leave the lights on and face the audience. People are there to see you, not read through the slides. Use wireless presenter so that you don't need to hide behind the podium/monitor.
- Slides are not telepromptor ... so don't keep staring at the monitor ... connect with the audience.

Remember these two quotes:
- Every presentation has potential to be great; every presentation is high stakes; and every audience deserves the absolute best - Nancy Duarte.
- You are presenting yourself rather than the topic. So, negative impact of poor presentation far exceeds failure of getting your point across ... Jean-Luc Lebrun

## Monday, August 15, 2011

### Thinking in Probability

This blogpost is based on a lecture I gave at Jacob Sir's classes yesterday. The intent of my lecture was to urge student to ask questions and read "stuff" beyond the textbook. To get the interest of the students, I started with 2 examples, which explained the dependent and independent events and also the need to stick to mathematical rigor rather than intuition.

First example was classic Monte Hall problem, which suprisingly none of my students had heard about. This was fortunate because this meant everyone would have to think about it on their own, rather than provide a prepared answer (thought-through by someone else). So, here is the problem:

There are three doors D1, D2 and D3; behind two of them there is a goat and behind the other is a car. The objective of the game is to win a car. So, you have to guess which door to chose ... Does it make difference whether you chose D1, D2 or D3 ? ... The consensus was NO, because the Prob(win)=0.33 for each of the door. To make the problem interesting the game show host (who know which door has car and which doors has goats), opens one of the remaining door that has goat. He now asks you based on this information whether you would like to switch your earlier choice. For example, earlier you chose D1 and the game show host opens D2 and shows that it has goat. Now you can either stick to D1 or switch to D3. The real question we are interested in is:
Does switching the door make more sense or staying with the same door ? or It does not matter whether you chose D1 or D3.

1. It does not matter whether you chose D1 or D3, because Prob(win) for each door is now 0.5
2. Staying with the door makes more sense, since we have increased the Prob(win) from 0.3 to 0.5
3. We really don't have plausible reason to switch, hence stay with the same door; especially since the game show host may want to trick us.

The correct answer is switching doubles the Prob(win), hence it makes more sense.
Consider the case where you don't switch:
Prob(win) = Prob(D1 = car) = 0.33 (The probability does not change because of an event outside its scope)
Now consider the case where you switch:
Prob(win) = Prob(D1 = goat) = 0.66

The other way to understand the problem is by enumerating
D1 : C G G
D2 : G C G
D3 : G G C
Swap: G C C => Prob(car) = 0.66
Stay: C G G => Prob(car) = 0.33

For people who chose the third incorrect answer, I asked them not to use information not provided in the problem and show as much restrain as possible to use intuition over logic. Deductive thinking says go from Step 2 to Step 3 only if there is a valid theorem, axiom, or logical reason (not intuition).

Let's move to the second problem:
Say I have a fair coin. I toss the coin four times and I get {Head, Head, Head, Head}, what will you bet on ?
Some students earlier based their answer on gambler's fallacy, stating that Tail is more likely since they have seen four consecutive heads. I then asked them to sit in circle and discuss with each other and then answer my question again. Through this exercise, I wanted them to learn the idea of collaboration to get an answer. And vola ... they came with the correct answer ... It doesnot matter whether we bet on head or tail, since the events are INDEPENDENT and hence each outcome is equally likely => Prob(T5 | H1, H2, H3, H4) = Prob(T5) = 0.5

Finally, I gave them two different explanation of probability:
1. Deterministic view for probability (explained in my earlier blogpost)

After explaining the deterministic view, I emphasized the importance of probability while solving real life problem. For example, in coin tossing experiment, you are able to make some rational predictions, even though you don't know (hidden) parameters like initial angle of coin, weight and density of coin, shape of the coin, velocity of the coin, resistance due to air, characteristics of surface where it lands, etc ... Similarly, you can build simpler probabilistic model for complex scenarios like weather forecasting, stock prediction, traffic congestion, ...

Reference:
1. http://niketanblog.blogspot.com/2009/11/god-does-not-play-dice.html
2. http://en.wikipedia.org/wiki/Monty_Hall_problem
3. http://en.wikipedia.org/wiki/Gambler's_fallacy
4. http://en.wikipedia.org/wiki/Measure_(mathematics)

## Tuesday, March 15, 2011

### Nice post about Sachin Tendular after India-SA match

Remember when you failed an examination. How many people recall that, your class, friends, relatives? You failed to make it to the IITs or IIMs. Who remembers. How many times have you had the feeling of being the best in your class, school , university, state….., you failed to get a visa stamped this quarter…, you missed a promotion this year…, how did it feel when you dad told you in your early twenties that you are good for nothing…..and now your boss tell you the same...

You keep introspecting and go into a shell when people most of whom don’t matter a dime in your life criticize you, back bite you, make fun of you. You are left sad and shattered and you cry when your own kin scoffs at you. You say I am feeling low today. It takes a lot from us to come out of these everyday situations and move on. A lot??? really?

Now here’s a man standing on the third man boundary in the last over of a world cup match. The bowler just has to bowl sensibly to win this game. What the man at the boundary sees is 4 rank bad bowls bowled without any sense of focus, planning or regret. India loses, yet again in those circumstances when he has done just about everything right.
He does not cry. Does not show any emotion. Just keeps his head down and leaves the field. He has seen these failures for 22 years now. And not just his class, relatives, friends but the whole world has seen these failures. We are too immature to even imagine what goes on in that mind and heart of his. That’s why I would never want to be Sachin.

True, he has single handedly lifted to moods of this entire nation umpteen number of times. He has been an inspiration to rise above our mediocrity. Nobody who has ever lifted the willow even comes close to this man’s genius. His dedication and metal strength is unparallel. This is specially for those people who would have made fun of him again last night when India lost. They are people who are mediocre in their own lives. Who just scoff at others to create cheap fun. Who have lived in a small hole throughout their lives and thought they have seen the oceans.

Think about the man himself. He is 37 years of age. He has been playing almost non stop for 22 years. The way he was running and diving around the field last night would have put 22 year olds to shame. The way he played the best opening quickies in the world was breathtaking. He just keeps getting better which is by the way humanly impossible. Its not for nothing that people call him GOD.
But still I don’t want to be in those shoes. We struggle in keeping our monotonous lives straight, lives which affect a limited number of people. Imagine what would be the magnitude of the inner struggle for him, pain both mental and physical, tears that have frozen with time, knees and ankles and every other joint in the body that is either bandaged or needs to be attended to every night, eyes that don’t sleep before a big game, bats that have scored 99 international tons and still see expectations from a billion people.

And he just converts those expectations into reality. We watch in awe, feel privileged.
Well I think its time that his team realizes that enough is enough. They have an obligation, not towards their country alone but towards sachin. They need to win this one for him. Stay assured that he himself will still deliver and leave no stone unturned to make sure India wins this cup.
This is not just a game, and he is not just a sportsman. Its much more than this. Words fail here.....

-- Anonymous

(This post is not written by Harsha Bhogle even though the source page said it was)

## Thursday, February 24, 2011

### Setting and running Hadoop 0.20.2

Extract the files and go into that directory

vim ~/.bash_profile
export JAVA_HOME=/usr/local/java/vms/java

vim ${HADOOP_HOME}/conf/masters asterix-master vim${HADOOP_HOME}/conf/slaves
asterix-001
asterix-002
asterix-003

vim ${HADOOP_HOME}/conf/core-site.xml <property> <name>fs.default.name</name> <value>hdfs://10.122.198.195:54310</value> <description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description> </property> vim${HADOOP_HOME}/conf/mapred-site.xml
<property>
<name>mapred.job.tracker</name>
<value>10.122.198.195:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
</description>
</property>

vim ${HADOOP_HOME}/conf/hdfs-site.xml <property> <name>dfs.replication</name> <value>2</value> <description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. </description> </property> <property> <name>dfs.name.dir</name> <value>/mnt/hdfs/name_dir/</value> </property> <property> <name>dfs.data.dir</name> <value>/mnt/hdfs/data_dir/</value> </property> <property> <name>dfs.datanode.address</name> <value>0.0.0.0:54325</value> </property> <property> <name>dfs.datanode.http.address</name> <value>0.0.0.0:54326</value> </property> vim${HADOOP_HOME}/conf/hadoop-env.sh
export JAVA_HOME=/usr/local/java/vms/java

Make sure the name and data directory are properly setup using following script. This also confirms that you can do passwordless ssh from master to slave machines.
vim ${HADOOP_HOME}/commands.sh cd /mnt sudo mkdir hdfs cd hdfs/ sudo mkdir name_dir sudo mkdir data_dir sudo chmod 777 -R /mnt/hdfs # Sync this file parallel-rsync -p 6 -r -h${HADOOP_HOME}/conf/slaves ${HADOOP_HOME}${HADOOP_HOME}

Then, run the above command.sh for every slave
user_name="ubuntu"
command="sh ${HADOOP_HOME}/commands.sh" for slaves_ip in$(cat ${HADOOP_HOME}/conf/slaves) do ssh${user_name}@${slaves_ip}${command}
done

Step 3: Compile hadoop and sync slaves
# Compile source code
${ANT_HOME}/bin/ant${ANT_HOME}/bin/ant jar
${ANT_HOME}/bin/ant examples # Sync slaves (see EC2 point 3 for installing parallel ssh) parallel-rsync -p 6 -r -h${HADOOP_HOME}/conf/slaves ${HADOOP_HOME}${HADOOP_HOME}

${HADOOP_HOME}/bin/hadoop namenode -format${HADOOP_HOME}/bin/stop-all.sh
${HADOOP_HOME}/bin/start-all.sh Check logs/ datanode namenode. Also see if the nodes are up, by using (any normal browser or) links http://master-ip:50070 If you get "incompatible namespace error" in datanodes log, let me try deleting hdfs dir and restarting hdfs Step 4: Loading the HDFS If you are loading from: 1. Local filesystem of master, use: either copyFromLocal or put bin/hadoop dfs -copyFromLocal /mnt/wikipedia_input/wikistats/pagecounts/pagecounts* wikipedia_input 2. Some other hdfs, use put 3. Files on some other machine accessible via scp # Configure following variables. Keep space between parenthesis of array and each item (no comma). # If this script gives error like '4: Syntax error: "(" unexpected', try bash <script-name> # If that gives permission denied error, put name of directories instead of${directories1[@]}
user_name="ubuntu"
machine_name="my_machine_name_or_ip"
file_prefix="pagecounts*"
hdfs_dir="wikipedia_input"
directories1=( "/mnt/data/sdb/space/oa/wikidata/dammit.lt/wikistats/archive/2010/09" "/mnt/data/sdc/space/oa/wikidata/dammit.lt/wikistats/archive/2010/10" "/mnt/data/sdd/space/oa/wikidata/dammit.lt/wikistats/archive/2010/11" )
for dir1 in ${directories1[@]} do echo "--------------------------------------------" cmd="ssh${user_name}@${machine_name} 'ls${dir1}/$file_prefix'" echo "Reading files using: "$cmd
for file1 in eval $cmd do file_name1=${file1##*/}
echo -n $file_name1 " " scp oa@asterix-001:$file1 .
bin/hadoop dfs -copyFromLocal $file_name1$hdfs_dir
rm $file_name1 done done Step 5: Run your mapreduce program${HADOOP_HOME}/bin/hadoop jar ${HADOOP_HOME}/build/hadoop-hop-0.2-examples.jar wordcount tpch_input tpch_output Some tips for amazon EC2: 1. Lot of times due to resource allocation policies, EC2 shutdowns your virtual machine (hence the assigned network) and master/slaves/namenodes/datanodes goes into fault tolerance mode and restart the jobs. You can set infinite time for heartbeat to tackle this error (This works because EC2 restarts you virtual machine after some time and there is no "real" failure, just temporary lags) by setting following into${HADOOP_HOME}/conf/hdfs-site.xml
<property>
<name>dfs.heartbeat.interval</name>
<value>6000</value>
<description>Determines datanode heartbeat interval in seconds.</description>
</property>
<property>
<name>dfs.heartbeat.recheck.interval</name>
<value>6000</value>
<description>Determines datanode heartbeat interval in seconds.</description>
</property>
<property>
<name>heartbeat.recheck.interval</name>
<value>6000</value>
<description>If dfs... doesnot work</description>
</property>
<property>
<name>dfs.socket.timeout</name>
<value>180000</value>
<description>dfs socket timeout</description>
</property>

EC2_KEYPAIR_DIR="/home/np6/EC2"
echo "\nEnter Public DNS of Master"
echo "If this doesnot work try, (exec ssh-agent bash) and then this command again"
ssh-agent
ssh-add ${EC2_KEYPAIR_DIR}/ec2-keypair.pem ssh ubuntu@$AMAZON_PUBLIC_DNS

3. Setting up java and other programs on slaves from master
# First install parallel ssh
user_name="ubuntu"
command="sudo apt-get install pssh"
for slaves_ip in $(cat${HADOOP_HOME}/conf/slaves)
do
ssh ${user_name}@${slaves_ip} ${command} done # Then install java (if you dont prefer openjdk) vim${HADOOP_HOME}/commands.sh
sudo add-apt-repository "deb http://archive.canonical.com/ lucid partner"
sudo apt-get update
sudo apt-get install sun-java6-jdk
sudo update-java-alternatives -s java-6-sun
echo "export JAVA_HOME=/usr/lib/jvm/java-6-sun" >> ~/.bashrc
source ~/.bashrc
echo "Check if the version of java is correct:"
java -version

# Sync this file
parallel-rsync -p 6 -r -h ${HADOOP_HOME}/conf/slaves${HADOOP_HOME} ${HADOOP_HOME} Then, run the above command.sh for every slave user_name="ubuntu" command="sh${HADOOP_HOME}/commands.sh"
for slaves_ip in $(cat${HADOOP_HOME}/conf/slaves)
do
ssh ${user_name}@${slaves_ip} ${command} done 4. Setting up hadoop master and slaves for lazy person (I would recommend you follow above steps instead) cd${HADOOP_HOME}/conf
echo "\nEnter Public DNS of Master"
sed -e "s/<name>mapred.job.tracker<\/name> <value>[-[:graph:]./]\{1,\}<\/value>/<name>mapred.job.tracker<\/name> <value>${AMAZON_PUBLIC_DNS}<\/value>/" hadoop-site.xml > a1.txt sed -e "s/<name>fs.default.name<\/name> <value>[-[:graph:]./]\{1,\}<\/value>/<name>fs.default.name<\/name> <value>hdfs:\/\/${AMAZON_PUBLIC_DNS}:9001<\/value>/" a1.txt > hadoop-site.xml
echo ${AMAZON_PUBLIC_DNS} > masters echo "\nEnter Slave string seperated with space (eg: domU-12-31-39-09-A0-84.compute-1.internal domU-12-31-39-0F-7E-61.compute-1.internal)" read SLAVE_STR echo$SLAVE_STR | sed -e "s/ /\n/" > slaves

Some other neat tricks:
1. Replace default java temp directory:
export JAVA_OPTS="-Djava.io.tmpdir=/mnt/java_tmp"

2. Setting number of open file limit to 99999
sudo vi /etc/security/limits.conf
ubuntu soft nofile 99999
ubuntu hard nofile 99999
* soft nofile 99999
* hard nofile 99999
sudo sysctl -p
ulimit -Hn

3. Checking the machines on the network
cat /etc/hosts

or naming machines as masters and slaves: vim /etc/hosts
10.1.0.1 asterix-master
127.0.0.1 localhost
10.0.0.1 asterix-001
10.0.0.2 asterix-002

/sbin/ifconfig

4. Configuring password-less ssh of master to slaves
slave_user_name="ubuntu"
for slaves_ip in $(cat${HADOOP_HOME}/conf/slaves)
do
ssh-copy-id -i $HOME/.ssh/id_rsa.pub${slave_user_name}@\${slaves_ip}
done

For more detailed step by step example, see

## Saturday, January 29, 2011

### Life's wonderful and unexpected moments (in no particular order)

1. Life changing speech from Dad or anyone you respect.
2. Laughing so hard your face hurts.
3. Helping someone when they need you the most.
4. Falling in love.
5. Hearing your favorite song on radio while driving and singing it unmelodiously.
6. Lying in bed listening to the rain outside, sometimes even waking up to see it with hot cup of coffee.
7. Mom's hot tea, birda, chicken, ... well almost anything she makes.
8. A good conversation.
9. Watching sunset/sunrise on the beach.
10. Laughing at yourself.
11. Laughing for absolutely no reason at all.
12. Laughing at an inside joke.
13. Being sincerely happy for someone.
14. Midnight phone calls that last for hours.
15. Having someone tell you that you're handsome/sexy/intelligent.
16. Accidentally overhearing someone say something nice about you.
17. Waking up and realizing you still have a few hours left to sleep.
19. Making new friends or spending time with old ones.
20. Playing with a new puppy.
21. Acting like teens once in a while.
22. Having someone play with your hair.
23. Having a wonderful dream.
25. Nice swedish massage.
26. Watching a really good movie cuddled up on a couch.
27. Going to a really good concert.
28. Going to a (football/cricket) game and shouting/dancing/cheering the whole time.
29. Getting butterflies in your stomach every time you see that one person.
30. Making eye contact with a cute stranger.
31. Winning a really competitive game.
32. Running into an old friend and realizing that some things (good or bad) never change.
33. A long distance phone call.
34. Taking a drive on a pretty road.
35. Feeling that you get just before you think you are going to get into trouble and especially the one after you escaped it narrowly.
36. Playing "pretend games" with kids.
37. Visiting a temple/church/mosque or any place of worship.
38. Hugging the person you love.
39. Getting a creative idea which keeps you awake all night.
40. Boating/Kayaking/Canoeing
41. Camping/Trekking in nature park/beach.
42. Having sex.
43. Laying on grass and watching sky through leaves.
44. Laying on beach and trying to count the stars.
45. Watersports: Para-sailing, snorkelling, jet-skiing, scuba-diving.
46. Swimming in the sea or lake (especially if you are not a good swimmer).
47. Browsing through books in library or bookstores especially the fields other than your research/work.
49. Learning and appreciating a new word.
50. Learning a new language.
51. Trying something for the first time.
52. Going through the pain barrier in the gym.
53. A hot shower.
54. Trying new food/restaurants.
55. Travelling to new places.
56. Getting drunk or handling a drunk friend.
58. Pursue any one topic till you attain excellence in it.
59. Doing something against rationality and totally from the guts, especially people around don't believe in you.
60. Teach a child something new.
61. Forgiving someone.
62. Buying something you always wanted for several years.
63. Dancing senselessly in a party/marriage/procession.
64. Realizing you are being loved and respected by few very important people in your life.
65. Jogging outdoors while listening to your favorite music in a nice weather.
66. Playing with colors during Holi.
67. Be a part of human pyramid during Dahi-Handi.
68. Burning fire-crackers during Diwali.
69. Sitting/Standing by the door in the train.
70. Riding a horse.
71. Learning to play a musical instruments.
72. Being thanked for a nice gesture.
73. Being heartbroken.
74. Running through sprinklers or walking in the rain.
75. Playing with the snow.
76. Feeling of calm in solitude or while meditating.
77. Blood-rush to the brain while doing yoga.
78. Dressing funny for a costume party.
79. Feeling of levitation while jumping or catching a frishbee.
80. Take-off and landing of an airplane.
81. Sitting in the cockpit of Boeing 777.
83. Sailing on a yatch.
84. Being married to an amazing person.
85. Attending a major sporting event: the World Cup (Cricket/Soccer), Super Bowl, the Olympics, the U.S. Open.
86. Throwing a huge party and inviting every one of your friends.
87. Going to Disneyland/Seaworld and other them parks.
88. Skydiving/Bungee jumping.