Thursday, February 24, 2011

Setting and running Hadoop 0.20.2

Step 1: Download and extract hadoop code
wget http://mirror.cloudera.com/apache/hadoop/core/hadoop-0.20.2/hadoop-0.20.2.tar.gz
Extract the files and go into that directory

Step 2: Configure Hadoop
vim ~/.bash_profile
export JAVA_HOME=/usr/local/java/vms/java
export HADOOP_HOME=/home/oa/hadoop-asterix/hadoop-0.20.2

vim ${HADOOP_HOME}/conf/masters
asterix-master

vim ${HADOOP_HOME}/conf/slaves
asterix-001
asterix-002
asterix-003


vim ${HADOOP_HOME}/conf/core-site.xml
<property>
<name>fs.default.name</name>
<value>hdfs://10.122.198.195:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>


vim ${HADOOP_HOME}/conf/mapred-site.xml
<property>
<name>mapred.job.tracker</name>
<value>10.122.198.195:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>


vim ${HADOOP_HOME}/conf/hdfs-site.xml
<property>
<name>dfs.replication</name>
<value>2</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
<property>
<name>dfs.name.dir</name>
<value>/mnt/hdfs/name_dir/</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/mnt/hdfs/data_dir/</value>
</property>
<property>
<name>dfs.datanode.address</name>
<value>0.0.0.0:54325</value>
</property>
<property>
<name>dfs.datanode.http.address</name>
<value>0.0.0.0:54326</value>
</property>


vim ${HADOOP_HOME}/conf/hadoop-env.sh
export JAVA_HOME=/usr/local/java/vms/java

Make sure the name and data directory are properly setup using following script. This also confirms that you can do passwordless ssh from master to slave machines.
vim ${HADOOP_HOME}/commands.sh
cd /mnt
sudo mkdir hdfs
cd hdfs/
sudo mkdir name_dir
sudo mkdir data_dir
sudo chmod 777 -R /mnt/hdfs

# Sync this file
parallel-rsync -p 6 -r -h ${HADOOP_HOME}/conf/slaves ${HADOOP_HOME} ${HADOOP_HOME}

Then, run the above command.sh for every slave
user_name="ubuntu"
command="sh ${HADOOP_HOME}/commands.sh"
for slaves_ip in $(cat ${HADOOP_HOME}/conf/slaves)
do
ssh ${user_name}@${slaves_ip} ${command}
done


Step 3: Compile hadoop and sync slaves
# Compile source code
${ANT_HOME}/bin/ant
${ANT_HOME}/bin/ant jar
${ANT_HOME}/bin/ant examples

# Sync slaves (see EC2 point 3 for installing parallel ssh)
parallel-rsync -p 6 -r -h ${HADOOP_HOME}/conf/slaves ${HADOOP_HOME} ${HADOOP_HOME}

Step 4: Run hadoop
${HADOOP_HOME}/bin/hadoop namenode -format
${HADOOP_HOME}/bin/stop-all.sh
${HADOOP_HOME}/bin/start-all.sh

Check logs/ datanode namenode. Also see if the nodes are up, by using (any normal browser or) links http://master-ip:50070
If you get "incompatible namespace error" in datanodes log, let me try deleting hdfs dir and restarting hdfs

Step 4: Loading the HDFS
If you are loading from:
1. Local filesystem of master, use: either copyFromLocal or put
bin/hadoop dfs -copyFromLocal /mnt/wikipedia_input/wikistats/pagecounts/pagecounts* wikipedia_input

2. Some other hdfs, use put

3. Files on some other machine accessible via scp
# Configure following variables. Keep space between parenthesis of array and each item (no comma).
# If this script gives error like '4: Syntax error: "(" unexpected', try bash <script-name>
# If that gives permission denied error, put name of directories instead of ${directories1[@]}
user_name="ubuntu"
machine_name="my_machine_name_or_ip"
file_prefix="pagecounts*"
hdfs_dir="wikipedia_input"
directories1=( "/mnt/data/sdb/space/oa/wikidata/dammit.lt/wikistats/archive/2010/09" "/mnt/data/sdc/space/oa/wikidata/dammit.lt/wikistats/archive/2010/10" "/mnt/data/sdd/space/oa/wikidata/dammit.lt/wikistats/archive/2010/11" )
for dir1 in ${directories1[@]}
do
echo "--------------------------------------------"
cmd="ssh ${user_name}@${machine_name} 'ls ${dir1}/$file_prefix'"
echo "Reading files using: " $cmd
for file1 in `eval $cmd`
do
file_name1=${file1##*/}
echo -n $file_name1 " "
scp oa@asterix-001:$file1 .
bin/hadoop dfs -copyFromLocal $file_name1 $hdfs_dir
rm $file_name1
done
done

Step 5: Run your mapreduce program
${HADOOP_HOME}/bin/hadoop jar ${HADOOP_HOME}/build/hadoop-hop-0.2-examples.jar wordcount tpch_input tpch_output

Some tips for amazon EC2:
1. Lot of times due to resource allocation policies, EC2 shutdowns your virtual machine (hence the assigned network) and master/slaves/namenodes/datanodes goes into fault tolerance mode and restart the jobs. You can set infinite time for heartbeat to tackle this error (This works because EC2 restarts you virtual machine after some time and there is no "real" failure, just temporary lags) by setting following into ${HADOOP_HOME}/conf/hdfs-site.xml
<property>
<name>dfs.heartbeat.interval</name>
<value>6000</value>
<description>Determines datanode heartbeat interval in seconds.</description>
</property>
<property>
<name>dfs.heartbeat.recheck.interval</name>
<value>6000</value>
<description>Determines datanode heartbeat interval in seconds.</description>
</property>
<property>
<name>heartbeat.recheck.interval</name>
<value>6000</value>
<description>If dfs... doesnot work</description>
</property>
<property>
<name>dfs.socket.timeout</name>
<value>180000</value>
<description>dfs socket timeout</description>
</property>

2. Login into EC2 machine:
EC2_KEYPAIR_DIR="/home/np6/EC2"
echo "\nEnter Public DNS of Master"
read AMAZON_PUBLIC_DNS
echo "If this doesnot work try, (exec ssh-agent bash) and then this command again"
ssh-agent
ssh-add ${EC2_KEYPAIR_DIR}/ec2-keypair.pem
ssh ubuntu@$AMAZON_PUBLIC_DNS

3. Setting up java and other programs on slaves from master
# First install parallel ssh
user_name="ubuntu"
command="sudo apt-get install pssh"
for slaves_ip in $(cat ${HADOOP_HOME}/conf/slaves)
do
ssh ${user_name}@${slaves_ip} ${command}
done

# Then install java (if you dont prefer openjdk)
vim ${HADOOP_HOME}/commands.sh
sudo add-apt-repository "deb http://archive.canonical.com/ lucid partner"
sudo apt-get update
sudo apt-get install sun-java6-jdk
sudo update-java-alternatives -s java-6-sun
echo "export JAVA_HOME=/usr/lib/jvm/java-6-sun" >> ~/.bashrc
echo "export HADOOP_HOME=/home/ubuntu/hadoop" >> ~/.bashrc
source ~/.bashrc
echo "Check if the version of java is correct:"
java -version

# Sync this file
parallel-rsync -p 6 -r -h ${HADOOP_HOME}/conf/slaves ${HADOOP_HOME} ${HADOOP_HOME}

Then, run the above command.sh for every slave
user_name="ubuntu"
command="sh ${HADOOP_HOME}/commands.sh"
for slaves_ip in $(cat ${HADOOP_HOME}/conf/slaves)
do
ssh ${user_name}@${slaves_ip} ${command}
done

4. Setting up hadoop master and slaves for lazy person (I would recommend you follow above steps instead)
cd ${HADOOP_HOME}/conf
echo "\nEnter Public DNS of Master"
read AMAZON_PUBLIC_DNS
sed -e "s/<name>mapred.job.tracker<\/name> <value>[-[:graph:]./]\{1,\}<\/value>/<name>mapred.job.tracker<\/name> <value>${AMAZON_PUBLIC_DNS}<\/value>/" hadoop-site.xml > a1.txt
sed -e "s/<name>fs.default.name<\/name> <value>[-[:graph:]./]\{1,\}<\/value>/<name>fs.default.name<\/name> <value>hdfs:\/\/${AMAZON_PUBLIC_DNS}:9001<\/value>/" a1.txt > hadoop-site.xml
echo ${AMAZON_PUBLIC_DNS} > masters
echo "\nEnter Slave string seperated with space (eg: domU-12-31-39-09-A0-84.compute-1.internal domU-12-31-39-0F-7E-61.compute-1.internal)"
read SLAVE_STR
echo $SLAVE_STR | sed -e "s/ /\n/" > slaves

Some other neat tricks:
1. Replace default java temp directory:
export JAVA_OPTS="-Djava.io.tmpdir=/mnt/java_tmp"

2. Setting number of open file limit to 99999
sudo vi /etc/security/limits.conf
ubuntu soft nofile 99999
ubuntu hard nofile 99999
* soft nofile 99999
* hard nofile 99999
sudo sysctl -p
ulimit -Hn

3. Checking the machines on the network
cat /etc/hosts

or naming machines as masters and slaves: vim /etc/hosts
10.1.0.1 asterix-master
127.0.0.1 localhost
10.0.0.1 asterix-001
10.0.0.2 asterix-002

Checking machines ip address
/sbin/ifconfig

4. Configuring password-less ssh of master to slaves
slave_user_name="ubuntu"
for slaves_ip in $(cat ${HADOOP_HOME}/conf/slaves)
do
ssh-copy-id -i $HOME/.ssh/id_rsa.pub ${slave_user_name}@${slaves_ip}
done

For more detailed step by step example, see
http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/