Tuesday, February 17, 2015

Plotting as a useful debugging tool

While writing the code for bayesian modeling, you will have to test the distribution (prior, likelihood or posterior). Here are two common scenarios that you might encounter:
  1. You want to test the function that generates random deviates. For example: you have derived a conjugate formula for a parameter of your model and want to test whether it is correct or not.
  2. You want to test a probability density function. For example: likelihood or posterior function that you might want to run rejection sampler on.
In both these cases, you will start by making sure the property of the distribution (for example: range, mean, variance) are correct. For example: if the parameter you are sampling is variance, then you will have “assert(returnedVariance > 0)” in your code. Then, the next obvious test should be visual inspection (trust me, it has helped me catch more bugs than I would by traditional programming debugging techniques/tools). This means you will plot the distribution and see if the output of your code makes sense.
We will start by simplifying the above two cases by assuming standard normal distribution. So, in the first case, we have access to “rnorm” function and in second case, we have access to “dnorm” function of R.
Case 1: In this case, we first collect random deviates (in “vals”) and then use ggplot to plot them:
library(ggplot2)
vals = rnorm(10000, mean=0, sd=1)
df = data.frame(xVals=vals)
ggplot(df, aes(x=xVals)) + geom_density()

The output of above R script will look something like this:
Case 2: In this case, we have to assume a bounding box and sample inside that to get “x_vals” and “y_vals” (just like rejection sampling):
library(ggplot2)
x_vals=runif(10000,min=-4,max=4)
y_vals=dnorm(x_vals, mean=0, sd=1)
df = data.frame(xVals=x_vals, yVals=y_vals)
ggplot(df, aes(x=xVals, y=yVals)) + geom_line()

The output of above R script will look something like this:
Just as a teaser to a post that I will post later, we can use the script somewhat similar to that of case 1 to study the characteristics of a distribution:

How to setup “passwordless ssh” on Amazon EC2 cluster

Often for running distributed applications, you may want to setup a new cluster or tweak an existing one (running on Amazon EC2) to support passwordless ssh. For creating a new cluster from scratch, there are lot of cluster management tools (which is beyond the scope of this blogpost). However, if all you want to do is setup “passwordless ssh” between nodes, then this post might be worth your read.
The script below assumes that you have completed following three steps:

Step 1. Created RSA public keypair on each of the machine:
cd ~
ssh-keygen -t rsa

Do not enter any paraphrase, instead just press [enter].

Step 2. Suppressed warning flags in ssh-config file:
sudo vim /etc/ssh/ssh_config
StrictHostKeyChecking no
UserKnownHostsFile=/dev/null
 
Step 3. Copied the key pair file “MyKeyPair.pem” to master’s home directory:
scp -i /local-path/MyKeyPair.pem /local-path/MyKeyPair.pem ubuntu@ec2-master-public-address.compute-1.amazonaws.com:~

Assuming that above three steps have been completed, run this script on the master to enable passwordless ssh between master-slave and/or slave-slave nodes:


    #!/bin/bash
    # Author: Niketan R. Pansare
     
    # Password-less ssh
    # Make sure you have transferred your key-pair to master
    if [ ! -f ~/.ssh/id_rsa.pub ]; then
      echo "Expects ~/.ssh/id_rsa.pub to be created. Run ssh-keygen -t rsa from home directory"
      exit
    fi
     
    if [ ! -f ~/MyKeyPair.pem ]; then
      echo "For enabling password-less ssh, transfer MyKeyPair.pem to master's home folder (e.g.: scp -i /local-path/MyKeyPair.pem /local-path/MyKeyPair.pem ubuntu@master_public_dns:~)"
      exit
    fi
     
    echo "Provide following ip-addresses"
    echo -n -e "${green}Public${endColor} dns address of master:"
    read MASTER_IP
    echo ""
     
    # Assumption here is that you want to create a small cluster ~ 10 nodes
    echo -n -e "${green}Public${endColor} dns addresses of slaves (separated by space):"
    read SLAVE_IPS
    echo "" 
     
    echo -n -e "Do you want to enable password-less ssh between ${green}master-slaves${endColor} (y/n):"
    read ENABLE_PASSWORDLESS_SSH
    echo ""
    if [ "$ENABLE_PASSWORDLESS_SSH" == "y" ]; then
      # Copy master's public key to itself
      #cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
     
      for SLAVE_IP in $SLAVE_IPS
      do
          echo "Checking passwordless ssh between master -> "$SLAVE_IP
        ssh -o PasswordAuthentication=no $SLAVE_IP /bin/true
        IS_PASSWORD_LESS="n"
        if [ $? -eq 0 ]; then
            echo "Passwordless ssh has been setup between master -> "$SLAVE_IP
            echo "Now checking passwordless ssh between "$SLAVE_IP" -> master"
     
          ssh $SLAVE_IP 'ssh -o PasswordAuthentication=no' $MASTER_IP '/bin/true'  
          if [ $? -eq 0 ]; then
              echo "Passwordless ssh has been setup between "$SLAVE_IP" -> master"
            IS_PASSWORD_LESS="y"
          fi
        fi
     
        if [ "$IS_PASSWORD_LESS" == "n" ]; then
          # ssh-copy-id gave me lot of issues, so will use below commands instead
          echo "Enabling passwordless ssh between master and "$SLAVE_IP
     
          # Copy master's public key to slave
          cat ~/.ssh/id_rsa.pub | ssh -i ~/MyKeyPair.pem "ubuntu@"$SLAVE_IP 'mkdir -p ~/.ssh ; cat >> ~/.ssh/authorized_keys' 
          # Copy slave's public key to master
          ssh -i ~/MyKeyPair.pem "ubuntu@"$SLAVE_IP 'cat ~/.ssh/id_rsa.pub' >> ~/.ssh/authorized_keys
          # Copy slave's public key to itself
          ssh -i ~/MyKeyPair.pem "ubuntu@"$SLAVE_IP 'cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys'
     
        fi    
      done
     
      echo ""
      echo "---------------------------------------------"
      echo "Testing password-less ssh on master -> slave"
      for SLAVE_IP in $SLAVE_IPS
      do
        ssh "ubuntu@"$SLAVE_IP  uname -a
      done
     
      echo ""
      echo "Testing password-less ssh on slave -> master"
      for SLAVE_IP in $SLAVE_IPS
      do
        ssh "ubuntu@"$SLAVE_IP 'ssh ' $MASTER_IP 'uname -a'
      done
      echo "---------------------------------------------"
      echo "Sorry, prefer to keep this check manual to avoid headache in Hadoop or any other distributed program."
      echo -n -e "Do you see error or something fishy in above block (y/n):"
      read IS_ERROR1
      echo ""
      if [ "$IS_ERROR1" == "y" ]; then
        echo "I am sorry to hear this script didn't work for you :("
        echo "Hint1: Its quite possible, slave doesnot contain ~/MyKeyPair.pem"
        echo "Hint2: sudo vim /etc/ssh/ssh_config and add StrictHostKeyChecking no and UserKnownHostsFile=/dev/null to it"
        exit
      fi
    fi
     
    echo -n -e "Do you want to enable password-less ssh between ${green}slave-slave${endColor} (y/n):"
    read ENABLE_PASSWORDLESS_SSH1
    echo ""
    if [ "$ENABLE_PASSWORDLESS_SSH1" == "y" ]; then
      if [ "$ENABLE_PASSWORDLESS_SSH" == "n" ]; then
        echo -n -e "In this part, the key assumption is that password-less ssh between ${green}master-slave${endColor} is enabled. Do you still want to continue (y/n):"
        read ANS1
        if [ "$ANS1" == "n" ]; then 
          exit
        fi
        echo ""
     
      fi
      for SLAVE_IP1 in $SLAVE_IPS
      do
        for SLAVE_IP2 in $SLAVE_IPS
        do
          if [ "$SLAVE_IP1" != "$SLAVE_IP2" ]; then
            # Checking assumes passwordless ssh has already been setup between master and slaves
              echo "[Warning:] Skipping checking passwordless ssh between "$SLAVE_IP1" -> "$SLAVE_IP2
            IS_PASSWORDLESS_SSH_BETWEEN_SLAVE_SET="n"
     
            # This will be true because ssh $SLAVE_IP1 is true
            #ssh $SLAVE_IP1 ssh -o PasswordAuthentication=no $SLAVE_IP2 /bin/true
            #if [ $? -eq 0 ]; then
     
     
            if [ "$IS_PASSWORDLESS_SSH_BETWEEN_SLAVE_SET" == "n" ]; then
              echo "Enabling passwordless ssh between "$SLAVE_IP1" and "$SLAVE_IP2
              # Note you are on master now, which we assume to have 
              ssh -i ~/MyKeyPair.pem $SLAVE_IP1 'cat ~/.ssh/id_rsa.pub' | ssh -i ~/MyKeyPair.pem $SLAVE_IP2 'cat >> ~/.ssh/authorized_keys' 
            else
              echo "Passwordless ssh has been setup between "$SLAVE_IP1" -> "$SLAVE_IP2
            fi
          fi
        done
     
      done
     
      echo "---------------------------------------------"
      echo "Testing password-less ssh on slave  slave"
      for SLAVE_IP1 in $SLAVE_IPS
      do
        for SLAVE_IP2 in $SLAVE_IPS
        do
          # Also, test password-less ssh on the current slave machine
          ssh $SLAVE_IP1 'ssh ' $SLAVE_IP2 'uname -a'
        done
      done
      echo "---------------------------------------------"
      echo "Sorry, prefer to keep this check manual to avoid headache in Hadoop or any other distributed program."
      echo -n -e "Do you see error or something fishy in above block (y/n):"
      read IS_ERROR1
      echo ""
      if [ "$IS_ERROR1" == "y" ]; then
        echo "I am sorry to hear this script didn't work for you :("
        echo "Hint1: Its quite possible, slave doesnot contain ~/MyKeyPair.pem"
        echo "Hint2: sudo vim /etc/ssh/ssh_config and add StrictHostKeyChecking no and UserKnownHostsFile=/dev/null to it"
        exit
      fi
    fi


Here is a sample output obtained by running the above script (src code available via link):
ubuntu@ip-XXX:~$ ./enablePasswordlessSSH.sh
Provide following ip-addresses

Public dns address of master:ec2-XXX-29.compute-1.amazonaws.com

Public dns addresses of slaves (separated by space):ec2-XXX-191.compute-1.amazonaws.com ec2-XXX-240.compute-1.amazonaws.com ec2-XXX-215.compute-1.amazonaws.com ec2-XXX-192.compute-1.amazonaws.com ec2-XXX-197.compute-1.amazonaws.com

Do you want to enable password-less ssh between master-slaves (y/n):y

Checking passwordless ssh between master -> ec2-XXX-YYY.compute-1.amazonaws.com
bunch of warning and possibly few "Permission denied (publickey)." (Ignore this !!!)

---------------------------------------------
Testing password-less ssh on master -> slave
Don't ignore any error here !!!
---------------------------------------------
Sorry, prefer to keep this check manual to avoid headache in Hadoop or any other distributed program.
Do you see error or something fishy in above block (y/n):n

Do you want to enable password-less ssh between slave-slave (y/n):y

---------------------------------------------
Testing password-less ssh on slave <-> slave
Don't ignore any error here !!!
---------------------------------------------

Sorry, prefer to keep this check manual to avoid headache in Hadoop or any other distributed program.
Do you see error or something fishy in above block (y/n):n


Warning: The above code does not check for passwordless ssh for slave-slave configuration and sets it blindly even if passwordless ssh is already enabled. It might not be big deal if you call this script few times but won’t be ideal if the cluster needs to dynamically modified over and over again. Still, to modify this behavior, look at the line: IS_PASSWORDLESS_SSH_BETWEEN_SLAVE_SET=”n”