Thursday, September 29, 2011

Presentation on Online Aggregation for Large MapReduce Jobs

This post has two purposes:
1. Give information about my Comp 600 presentation at Rice University
2. Enumerate the lesson's learnt while preparing and giving the presentation.

(If the above videos does not work, click on: Dailymotion or Vimeo)

Also, if you have time, go through my slides or better still read my paper :)
- Slides
- Paper

FAQs during the presentation at VLDB:

1. What if the blocks are equally sized and spread across uniformly ?
- We used exactly that configuration. The input blocks of wikipedia logs are approximately of same size and HDFS does good job of spreading the data across uniformly. Still, we found biased results in the beginning.

2. Does your system replace general database ?
- Nope. As of now, it only performs single table aggregation with group-by. For a general online database, see DBO paper by our group.

3. Will you release your code ?
- Yes, we intend to release our code as open-source. For current version, see

4. Who do you see as potential user of the system ?
- Any company that has massive data stored in HDFS (logically as single table) and want to perform aggregation in online fashion or only till certain accuracy. Note, that the table containing the data need not be pre-randomized.

5. What does user of the system needs to know ?
- Just how to write and run a hadoop job ... All he/she has to do is write a hadoop mapper and set appropriate parameters in job configuration. For details see appendix of our paper.

6. You didn't explain too much statistics in the talk ... I want to know more
- Yes, you are absolutely right. Infact, that was intentional :) ... If I have to give one-line summary of statistics here is it: Sample from a Multi-variate normal on value, processing time and scheduling time (which have correlation factor associated with between each of them) which takes in account the 'lower bound' information of processing time of the blocks that are currently been processed ... rather than simply sampling from the distribution over data.

7. You don't talk about performance ... Isn't randomized scheduling policy bad for performance ?
- We didnt talk about the performance, mainly because we didnot provide any comparative study for performance in our paper. Note that, our policy is not same as a strict randomized policy. We expect the master to maintain a randomize queue and start the timer when a worker asks for the block. If for some reason (say locality), worker doesnot want that block or master has some internal (locality-based) scheduling algorithm, he can leave the timer ON and go for next block in the randomized queue. This is necessary machinery for sampling theory. This will yield performance that is similar to locality-based scheduling. However, in our current implementation, we have not used this mechanism. We intend to do the performance study in our future work.

Presentation tips:

Clearly my presentation has lot of room for improvement, for example: need to talk slower, more uniform layout of the slides, more interaction with audience etc. However, I have learnt a great deal while preparing and giving this presentation. I will summarize few points that I (and probably lot of graduate students) need to be careful about. For more detailed analysis of these points, I suggest you read Presentation Zen, Slide:ology and When the scientist present.

Step 1: Planning your presentation
1. Create a relaxing environment
- Go to a library, coffee place, garden (place where no one will disturb you)
- Take time to relax and forget about daily chores.
- Don't bring your laptop. In this step, you will create rough sketch of the presentation on paper (not slideware)
- Lot of the time, changing locations helps you to be more creative.

2. Make sure you understand the big picture of your presentation.
Can you find 30-45 seconds answer to following questions:
- What is the single most important "key" point that I want my audience to remember
- Why should that point matter to the audience

Also, make sure that your presentation title has keywords that highlights the big picture (as many people (if not all) will come to your talk either reading just the title (and abstract) or your name). This does not mean the title has to be too long; in fact shorter and thought-provoking title are preferred.

3. Now, create a sketch of the presentation. This involves enumerating key ideas you will be presenting and also organising/ordering them. You can use either or all of the following tools:
- Mindmap on whiteboard or paper
- Post-it (use sharpie to make sure you don't misuse post-it)
- Brainstorm the list of ideas + Grouping them
- Sketching or drawing 
Nancy Duarte says most effective presentations involves telling a story that audience can connect and relate to, rather than just present the information/data. Also, best professors are good story-tellers. They just don't go through the material in the book, but put their own personality, character and experience into the material in form of a narrative, which is illuminating, engaging and memorable.

The most useful point I have learnt from Presentation Zen is that the presentation consists of three parts:
- Slides the audience will see (which you will create in Step 2)
- Notes only you will see (which you have created in this Step)
- Handouts that audience can take away (so that they don't have to jot down key definition or formula). This in worst case could simply be a copy of your paper.
This essentially means that the slides are not telepromptor, that you read aloud throughout your presentation.

Step 2: Design the slides
To succeed as a presenter, you need to think as a designer (from slide:logy).

1. First few slides are very very important !!!
- I like to give use them to tell the audience the big picture of the presentation.
- It is also necessary that this has to be catch their attention, as we all know the attention-span of audience decreases exponentially after first few minutes. Don't use the first few minutes reading through the 'Outline' slide.
- Use following principle: Tell them what you are going to present, present it, then tell them again what you presented :)

2. Slides should be as simple as possible.
- Don't use fancy, distracting background images. I prefer sticking to simple white background slide.
- Good design has plenty of empty spaces.
- Avoid typical bullet point presentations. They are boring and they do more harm than good.
- Don't use 3D figures when 2D figures can suffice
- Never ever clutter the slides with useless data, words, figures, images, etc

3. 1 message per slide ... not more !!
- Do you understand that message ?
- Does your slide reflect that message ? You don't have to write down the message into your slide
- Can I make this slide simpler and still make sure that the message is conveyed. In most cases, this is done by replace bunch of text either with an appropriate image or animation.
- A lot of the time, entire message can be conveyed just by a quote from a reliable or important person.

4. Try to find a good figure or animation to convey a complex idea, algorithm or system. Here are some tips on creating animation:
- Using same slide background, create many slides (by copy operation). Change only 1 detail per slide. When you keep moving from one slide to next, it would give an effect of animation, something like flip-book. However, this process creates a lot of slides (which is problematic in many cases). So, if you want to fit everything into 1 slide, you can create a flash object, gif file or avi file and embed it into the slide. Most people use Adobe tools to create professional looking flash files, but here is poor man's solution:
- Use flip-book style slide transition technique and desktop recorder (eg: recordmydesktop) to create a movie. It usually creates movie in ogv file format.
- Also, lot of time you need to create animation using statistical distribution. In that case, use animation library of R and create ogv movie using desktop recorder. (This library also allows you store the animation in gif, but I am having some problems with it).

# R code for animation
ani.options(interval = 1, nmax = 20)
sd1 <- 41
for (i in 1:ani.options("nmax")) {
    x=seq(900, 1100,length=200)
    ani.pause() ## pause for a while ('interval')
    sd1 <- sd1 - 2
a. Convert OGV to AVI:
mencoder input.ogv -ovc xvid -oac mp3lame -xvidencopts pass=1 -speed 0.5 -o output.avi
b. Convert OGV to GIF (or swf):
ffmpeg -i input.ogv -loop_output 0 -pix_fmt rgb24 -r 10 -s 320x240 output.gif
"-loop_output 0" means loop forever.

5. Is the content tailored to meet audience' expectation:
- Employers/future collaborators will try to evaluate you rather than the topic.
- Students (undergrad/first year grads/ ..) or people from different area want to understand the problem and intuition behind the solution. Hence, they are more interested in motivation section.
- Co-author will look for recognition.
- Sponsor will look for acknowledgement.
- People who work in same area will try to see if how your system/solution compares to theirs (i.e. related work) or may be are there to look for a new problem (i.e. your future work).

6. Other design tips:
- Use contrast (using font type, font color, font size, shape, color, images) accentuate the difference. (If it is different, make it very different - Garr Reynolds). It helps audience to identify the message of the slide quickly.
- Repetition: Use same image (or anchor) to represent the same concept throughout the presentation.
- Alignment: Use grids, invisible lines. Audience needs to know the order in which to process the information.
- Proximity
- Hierarchy, Unity
- Use high quality graphics. There are websites that create and sell them like; or even give them for free like, etc.
- Nothing distracts the audience more than a cluttered slide.
- Make sure your font size is greater than 24 point (so that audience can read it from the projector)
- Use comics ... it not only conveys the information efficiently, but also makes the presentation playful. Though many of my colleagues disagree with this, I believe most audience wants to have fun even if it is a research presentation .. so lighten the mood, be creative and "artsy" with your slides :)

"The more strikingly visual your presentation is, the more people will remember it. And more importantly they will remember you" - Paul Arden.
Here are some slides that are well designed:

Step 3: Rehearse and re-rehearse
- Use both notes and slides.
- Rehearse until you can remember and are comfortable with them. Imagine the worst case scenario that you lost your laptop, slides and all your notes ... and you have to give your presentation !!! ... If you can do that, you are ready :)
- Here you also do the job of editing, if you feel certain slides are just fillers and can be removed ... be brutal and delete them.
- Listen to for motivation on how to present.

Step 4: Delivery
- Go to the presentation room 15 min before the scheduled time and make sure your slides are compatible, projector works, ... Also, bring a bottle of water, just in case.
- Leave the lights on and face the audience. People are there to see you, not read through the slides. Use wireless presenter so that you don't need to hide behind the podium/monitor.
- Slides are not telepromptor ... so don't keep staring at the monitor ... connect with the audience.

Remember these two quotes:
- Every presentation has potential to be great; every presentation is high stakes; and every audience deserves the absolute best - Nancy Duarte.
- You are presenting yourself rather than the topic. So, negative impact of poor presentation far exceeds failure of getting your point across ... Jean-Luc Lebrun