Friday, November 13, 2015

Using Google's TensorFlow for Kaggle competition

Recently, Google Brain team released their neural network library 'TensorFlow'. Since Google has a state-of-the-art Deep Learning system, I wanted to explore TensorFlow by trying it out for my first Kaggle submission (Digit Recognition) . After spending few hours getting to know their jargons/APIs, I modified their multilayer convolution neural network and came 140th (with 98.4% accuracy) ... not too shabby for my first submission :D.

If you are interested in outscoring me, apply cross-validation on the below python code or may be consider using ensembles:
from input_data import *
import pandas

class DataSets(object):
 pass

mnist = DataSets()
df = pandas.read_csv('train.csv')

train_images = numpy.multiply(df.drop('label', 1).values, 1.0 / 255.0)
train_labels = dense_to_one_hot(df['label'].values)

#Add MNIST data from Yan LeCun's website for better accuracy. We hold out test, just for accuracy sake, but could have easily added it :)

mnist2 = read_data_sets("/tmp/data/", one_hot=True)
train_images = numpy.concatenate((train_images, mnist2.train._images), axis=0)
train_labels = numpy.concatenate((train_labels, mnist2.train._labels), axis=0)

VALIDATION_SIZE = 5000
validation_images = train_images[:VALIDATION_SIZE]
validation_labels = train_labels[:VALIDATION_SIZE]
train_images = train_images[VALIDATION_SIZE:]
train_labels = train_labels[VALIDATION_SIZE:]

mnist.train = DataSet([], [], fake_data=True)
mnist.train._images = train_images
mnist.train._labels = train_labels

mnist.validation = DataSet([], [], fake_data=True)
mnist.validation._images = validation_images
mnist.validation._labels = validation_labels

df1 = pandas.read_csv('test.csv')
test_images = numpy.multiply(df1.values, 1.0 / 255.0)
numTest = df1.shape[0]
test_labels = dense_to_one_hot(numpy.repeat([1], numTest))
mnist.test = DataSet([], [], fake_data=True)
mnist.test._images = test_images
mnist.test._labels = test_labels

import tensorflow as tf
sess = tf.InteractiveSession()
x = tf.placeholder("float", [None, 784])   # x is input features
W = tf.Variable(tf.zeros([784,10]))           # weights 
b = tf.Variable(tf.zeros([10]))                    # bias
y_ = tf.placeholder("float", [None,10])   # y' is input labels
#Weight Initialization
def weight_variable(shape):
  initial = tf.truncated_normal(shape, stddev=0.1)
  return tf.Variable(initial)
  
def bias_variable(shape):
  initial = tf.constant(0.1, shape=shape)
  return tf.Variable(initial)
  
# Convolution and Pooling
def conv2d(x, W):
  return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')

def max_pool_2x2(x):
  return tf.nn.max_pool(x, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')
  
# First Convolutional Layer
W_conv1 = weight_variable([5, 5, 1, 32])
b_conv1 = bias_variable([32])
x_image = tf.reshape(x, [-1,28,28,1])
h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1)
h_pool1 = max_pool_2x2(h_conv1)

# Second Convolutional Layer
W_conv2 = weight_variable([5, 5, 32, 64])
b_conv2 = bias_variable([64])
h_conv2 = tf.nn.relu(conv2d(h_pool1, W_conv2) + b_conv2)
h_pool2 = max_pool_2x2(h_conv2)

# Densely Connected Layer
W_fc1 = weight_variable([7 * 7 * 64, 1024])
b_fc1 = bias_variable([1024])
h_pool2_flat = tf.reshape(h_pool2, [-1, 7*7*64])
h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1)

# Dropout
keep_prob = tf.placeholder("float")
h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob)

# Readout Layer
W_fc2 = weight_variable([1024, 10])
b_fc2 = bias_variable([10])
y=tf.nn.softmax(tf.matmul(h_fc1_drop, W_fc2) + b_fc2)
cross_entropy = -tf.reduce_sum(y_*tf.log(y))

train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)
tf.initialize_all_variables().run()
for i in range(20000):
 batch = mnist.train.next_batch(50)
 train_step.run(feed_dict={x: batch[0], y_: batch[1], keep_prob: 0.5})

correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))
print "test accuracy %g"%accuracy.eval(feed_dict={x: mnist2.test.images, y_: mnist2.test.labels, keep_prob: 1.0})
prediction = tf.argmax(y,1).eval(feed_dict={x: mnist.test.images, keep_prob: 1.0})

f=open('prediction.txt','w')
s1='\n'.join(str(x) for x in prediction)
f.write(s1)
f.close()
The above script assumes that you have downloaded train.csv and test.csv from Kaggle's website, input_data.py from TensorFlow's website and installed pandas/TensorFlow.

Here are few observations based on my experience playing with TensorFlow:
  1. TensorFlow does not have an optimizer
    1. TensorFlow statically maps an high-level expression (for example "matmul") to a predefined low-level operator (for example: matmul_op.h) based on whether you are using CPU or GPU enabled TensorFlow. On other hand, SystemML compiles a matrix multiplication expression (X %*% y) into one of many matrix-multiplication related physical operators using a sophisticated optimizer that adapts to the underlying data and cluster characteristics.
    2. Other popular open-source neural network libraries are CaffeTheano and Torch
  2. TensorFlow is a parallel, but not a distributed system:
    1. It does parallelize its computation (across CPU cores and also across GPUs):  
    2. Google has released only the single-node version and kept distributed version in-house. This means that the open-sourced version does not have a parameter server. 
  3. TensorFlow is easy to use, but difficult to debug:
    1. I like TensorFlow's Python API and if the script/data/parameters are all correct, it works absolutely fine :)
    2. But, if something fails, the error messages thrown by TensorFlow are difficult to decipher. This is because the error messages point to a generated physical operator (for example: tensorflow.python.framework.errors.InvalidArgumentError: ReluGrad input), not to the line of code in the Python program.
  4. TensorFlow is slow to train and not yet robust enough:
    1. Here are some initial numbers by Alex Smola comparing TensorFlow to other open-source deep learning systems:

Tuesday, February 17, 2015

Plotting as a useful debugging tool

While writing the code for bayesian modeling, you will have to test the distribution (prior, likelihood or posterior). Here are two common scenarios that you might encounter:
  1. You want to test the function that generates random deviates. For example: you have derived a conjugate formula for a parameter of your model and want to test whether it is correct or not.
  2. You want to test a probability density function. For example: likelihood or posterior function that you might want to run rejection sampler on.
In both these cases, you will start by making sure the property of the distribution (for example: range, mean, variance) are correct. For example: if the parameter you are sampling is variance, then you will have “assert(returnedVariance > 0)” in your code. Then, the next obvious test should be visual inspection (trust me, it has helped me catch more bugs than I would by traditional programming debugging techniques/tools). This means you will plot the distribution and see if the output of your code makes sense.
We will start by simplifying the above two cases by assuming standard normal distribution. So, in the first case, we have access to “rnorm” function and in second case, we have access to “dnorm” function of R.
Case 1: In this case, we first collect random deviates (in “vals”) and then use ggplot to plot them:
library(ggplot2)
vals = rnorm(10000, mean=0, sd=1)
df = data.frame(xVals=vals)
ggplot(df, aes(x=xVals)) + geom_density()

The output of above R script will look something like this:
Case 2: In this case, we have to assume a bounding box and sample inside that to get “x_vals” and “y_vals” (just like rejection sampling):
library(ggplot2)
x_vals=runif(10000,min=-4,max=4)
y_vals=dnorm(x_vals, mean=0, sd=1)
df = data.frame(xVals=x_vals, yVals=y_vals)
ggplot(df, aes(x=xVals, y=yVals)) + geom_line()

The output of above R script will look something like this:
Just as a teaser to a post that I will post later, we can use the script somewhat similar to that of case 1 to study the characteristics of a distribution:

How to setup “passwordless ssh” on Amazon EC2 cluster

Often for running distributed applications, you may want to setup a new cluster or tweak an existing one (running on Amazon EC2) to support passwordless ssh. For creating a new cluster from scratch, there are lot of cluster management tools (which is beyond the scope of this blogpost). However, if all you want to do is setup “passwordless ssh” between nodes, then this post might be worth your read.
The script below assumes that you have completed following three steps:

Step 1. Created RSA public keypair on each of the machine:
cd ~
ssh-keygen -t rsa

Do not enter any paraphrase, instead just press [enter].

Step 2. Suppressed warning flags in ssh-config file:
sudo vim /etc/ssh/ssh_config
StrictHostKeyChecking no
UserKnownHostsFile=/dev/null
 
Step 3. Copied the key pair file “MyKeyPair.pem” to master’s home directory:
scp -i /local-path/MyKeyPair.pem /local-path/MyKeyPair.pem ubuntu@ec2-master-public-address.compute-1.amazonaws.com:~

Assuming that above three steps have been completed, run this script on the master to enable passwordless ssh between master-slave and/or slave-slave nodes:


    #!/bin/bash
    # Author: Niketan R. Pansare
     
    # Password-less ssh
    # Make sure you have transferred your key-pair to master
    if [ ! -f ~/.ssh/id_rsa.pub ]; then
      echo "Expects ~/.ssh/id_rsa.pub to be created. Run ssh-keygen -t rsa from home directory"
      exit
    fi
     
    if [ ! -f ~/MyKeyPair.pem ]; then
      echo "For enabling password-less ssh, transfer MyKeyPair.pem to master's home folder (e.g.: scp -i /local-path/MyKeyPair.pem /local-path/MyKeyPair.pem ubuntu@master_public_dns:~)"
      exit
    fi
     
    echo "Provide following ip-addresses"
    echo -n -e "${green}Public${endColor} dns address of master:"
    read MASTER_IP
    echo ""
     
    # Assumption here is that you want to create a small cluster ~ 10 nodes
    echo -n -e "${green}Public${endColor} dns addresses of slaves (separated by space):"
    read SLAVE_IPS
    echo "" 
     
    echo -n -e "Do you want to enable password-less ssh between ${green}master-slaves${endColor} (y/n):"
    read ENABLE_PASSWORDLESS_SSH
    echo ""
    if [ "$ENABLE_PASSWORDLESS_SSH" == "y" ]; then
      # Copy master's public key to itself
      #cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
     
      for SLAVE_IP in $SLAVE_IPS
      do
          echo "Checking passwordless ssh between master -> "$SLAVE_IP
        ssh -o PasswordAuthentication=no $SLAVE_IP /bin/true
        IS_PASSWORD_LESS="n"
        if [ $? -eq 0 ]; then
            echo "Passwordless ssh has been setup between master -> "$SLAVE_IP
            echo "Now checking passwordless ssh between "$SLAVE_IP" -> master"
     
          ssh $SLAVE_IP 'ssh -o PasswordAuthentication=no' $MASTER_IP '/bin/true'  
          if [ $? -eq 0 ]; then
              echo "Passwordless ssh has been setup between "$SLAVE_IP" -> master"
            IS_PASSWORD_LESS="y"
          fi
        fi
     
        if [ "$IS_PASSWORD_LESS" == "n" ]; then
          # ssh-copy-id gave me lot of issues, so will use below commands instead
          echo "Enabling passwordless ssh between master and "$SLAVE_IP
     
          # Copy master's public key to slave
          cat ~/.ssh/id_rsa.pub | ssh -i ~/MyKeyPair.pem "ubuntu@"$SLAVE_IP 'mkdir -p ~/.ssh ; cat >> ~/.ssh/authorized_keys' 
          # Copy slave's public key to master
          ssh -i ~/MyKeyPair.pem "ubuntu@"$SLAVE_IP 'cat ~/.ssh/id_rsa.pub' >> ~/.ssh/authorized_keys
          # Copy slave's public key to itself
          ssh -i ~/MyKeyPair.pem "ubuntu@"$SLAVE_IP 'cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys'
     
        fi    
      done
     
      echo ""
      echo "---------------------------------------------"
      echo "Testing password-less ssh on master -> slave"
      for SLAVE_IP in $SLAVE_IPS
      do
        ssh "ubuntu@"$SLAVE_IP  uname -a
      done
     
      echo ""
      echo "Testing password-less ssh on slave -> master"
      for SLAVE_IP in $SLAVE_IPS
      do
        ssh "ubuntu@"$SLAVE_IP 'ssh ' $MASTER_IP 'uname -a'
      done
      echo "---------------------------------------------"
      echo "Sorry, prefer to keep this check manual to avoid headache in Hadoop or any other distributed program."
      echo -n -e "Do you see error or something fishy in above block (y/n):"
      read IS_ERROR1
      echo ""
      if [ "$IS_ERROR1" == "y" ]; then
        echo "I am sorry to hear this script didn't work for you :("
        echo "Hint1: Its quite possible, slave doesnot contain ~/MyKeyPair.pem"
        echo "Hint2: sudo vim /etc/ssh/ssh_config and add StrictHostKeyChecking no and UserKnownHostsFile=/dev/null to it"
        exit
      fi
    fi
     
    echo -n -e "Do you want to enable password-less ssh between ${green}slave-slave${endColor} (y/n):"
    read ENABLE_PASSWORDLESS_SSH1
    echo ""
    if [ "$ENABLE_PASSWORDLESS_SSH1" == "y" ]; then
      if [ "$ENABLE_PASSWORDLESS_SSH" == "n" ]; then
        echo -n -e "In this part, the key assumption is that password-less ssh between ${green}master-slave${endColor} is enabled. Do you still want to continue (y/n):"
        read ANS1
        if [ "$ANS1" == "n" ]; then 
          exit
        fi
        echo ""
     
      fi
      for SLAVE_IP1 in $SLAVE_IPS
      do
        for SLAVE_IP2 in $SLAVE_IPS
        do
          if [ "$SLAVE_IP1" != "$SLAVE_IP2" ]; then
            # Checking assumes passwordless ssh has already been setup between master and slaves
              echo "[Warning:] Skipping checking passwordless ssh between "$SLAVE_IP1" -> "$SLAVE_IP2
            IS_PASSWORDLESS_SSH_BETWEEN_SLAVE_SET="n"
     
            # This will be true because ssh $SLAVE_IP1 is true
            #ssh $SLAVE_IP1 ssh -o PasswordAuthentication=no $SLAVE_IP2 /bin/true
            #if [ $? -eq 0 ]; then
     
     
            if [ "$IS_PASSWORDLESS_SSH_BETWEEN_SLAVE_SET" == "n" ]; then
              echo "Enabling passwordless ssh between "$SLAVE_IP1" and "$SLAVE_IP2
              # Note you are on master now, which we assume to have 
              ssh -i ~/MyKeyPair.pem $SLAVE_IP1 'cat ~/.ssh/id_rsa.pub' | ssh -i ~/MyKeyPair.pem $SLAVE_IP2 'cat >> ~/.ssh/authorized_keys' 
            else
              echo "Passwordless ssh has been setup between "$SLAVE_IP1" -> "$SLAVE_IP2
            fi
          fi
        done
     
      done
     
      echo "---------------------------------------------"
      echo "Testing password-less ssh on slave  slave"
      for SLAVE_IP1 in $SLAVE_IPS
      do
        for SLAVE_IP2 in $SLAVE_IPS
        do
          # Also, test password-less ssh on the current slave machine
          ssh $SLAVE_IP1 'ssh ' $SLAVE_IP2 'uname -a'
        done
      done
      echo "---------------------------------------------"
      echo "Sorry, prefer to keep this check manual to avoid headache in Hadoop or any other distributed program."
      echo -n -e "Do you see error or something fishy in above block (y/n):"
      read IS_ERROR1
      echo ""
      if [ "$IS_ERROR1" == "y" ]; then
        echo "I am sorry to hear this script didn't work for you :("
        echo "Hint1: Its quite possible, slave doesnot contain ~/MyKeyPair.pem"
        echo "Hint2: sudo vim /etc/ssh/ssh_config and add StrictHostKeyChecking no and UserKnownHostsFile=/dev/null to it"
        exit
      fi
    fi


Here is a sample output obtained by running the above script (src code available via link):
ubuntu@ip-XXX:~$ ./enablePasswordlessSSH.sh
Provide following ip-addresses

Public dns address of master:ec2-XXX-29.compute-1.amazonaws.com

Public dns addresses of slaves (separated by space):ec2-XXX-191.compute-1.amazonaws.com ec2-XXX-240.compute-1.amazonaws.com ec2-XXX-215.compute-1.amazonaws.com ec2-XXX-192.compute-1.amazonaws.com ec2-XXX-197.compute-1.amazonaws.com

Do you want to enable password-less ssh between master-slaves (y/n):y

Checking passwordless ssh between master -> ec2-XXX-YYY.compute-1.amazonaws.com
bunch of warning and possibly few "Permission denied (publickey)." (Ignore this !!!)

---------------------------------------------
Testing password-less ssh on master -> slave
Don't ignore any error here !!!
---------------------------------------------
Sorry, prefer to keep this check manual to avoid headache in Hadoop or any other distributed program.
Do you see error or something fishy in above block (y/n):n

Do you want to enable password-less ssh between slave-slave (y/n):y

---------------------------------------------
Testing password-less ssh on slave <-> slave
Don't ignore any error here !!!
---------------------------------------------

Sorry, prefer to keep this check manual to avoid headache in Hadoop or any other distributed program.
Do you see error or something fishy in above block (y/n):n


Warning: The above code does not check for passwordless ssh for slave-slave configuration and sets it blindly even if passwordless ssh is already enabled. It might not be big deal if you call this script few times but won’t be ideal if the cluster needs to dynamically modified over and over again. Still, to modify this behavior, look at the line: IS_PASSWORDLESS_SSH_BETWEEN_SLAVE_SET=”n”