Niketan Pansare's blog

Friday, April 13, 2018

Installing Apache SystemML on the IBM Cloud

To use the latest released version of Apache SystemML on the IBM Cloud, please install it using the following commands in the jupyter shell:

1. Install the latest version of SystemML (for example: 1.2.0):

!pip install systemml

2.. Please execute the following command in the notebook after installation:

!ln -s -f ~/user-libs/python3/systemml/systemml-java/systemml-1.2.0-extra.jar ~/user-libs/spark2/systemml-1.2.0-extra.jar
!ln -s -f ~/user-libs/python3/systemml/systemml-java/systemml-1.2.0.jar ~/user-libs/spark2/systemml-1.2.0.jar

3. Then restart the kernel.

4. check if the correct version was installed:

!pip show systemml

5. (Important) Finally, check if the correct version is being picked up:

from systemml import MLContext
ml = MLContext(spark)
ml.version()

Tuesday, January 16, 2018

Notes from the Micro-MBA course (Part 1)

In my previous blogposts 1 2 3, I summarized the notes from entrepreneurship course offered by Rice Center for Engineering Leadership. This blogpost will extend the previous blogposts for larger organization based on the notes from the Micro-MBA class at IBM Almaden Research Center. As a disclaimer: this blogpost should be viewed as my personal biased opinions based on limited understanding of the subject, rather than any official position of IBM or the instructor. Also, I will skip many topics for brevity.

The Micro-MBA course can divided into following three sub-topics:

Economics
Finance
Accounting

Economics

The fundamental assumption of economics is of scarcity: People have unlimited wants but limited resources, namely:

Labor (human time and work)
Natural resources (raw materials)
Capital (man-made materials needed for the production of other goods such as buildins and machinery).

Hence, we have to chose how to use the limited resources to maximize our wants and needs. An economist studies the choices to following three fundamental questions:

WHAT (goods and services) to produce ?
HOW to produce them ?
WHO will receive the goods and services produced ?

WHO will receive the goods and services produced ?

The answer to the third question depends on what are the economic policies of the country. In a country with centrally planned economy such as Cuba and North Korea (and Soviet Union from 1917 to 1991), the government decides how the goods and service produced be allocated. On the other hand, in market economy, the individuals/households/firms who are most willing and able to buy the goods and services receive them (i.e. based on principle of supply and demand). Note, most countries today follow mixed economy, where most economic decisions results from the interactions of the buyers and sellers in the market, but in which the government plays a significant role in the allocation of resources [1].

	Centrally planned economy	Market economy
Productive efficiency (produced at lowest cost)		Better
Allocative efficiency (reflects consumer preferences)		Better
Equity (fair distribution of goods and services)	Better
Freedom	Better for freedom from uncertainty	Better for freedom of choice
Main proponent	Karl Marx, who posited in Das Kapital that market competitions exploits labor and ultimately fosters monopolies.	Adam Smith, on the other hand claimed that competition and the self-interested pursuit of profit causes the firms and the individuals to behave in ways that are ultimately socially beneficial.

WHAT (goods and services) to produce and HOW to produce them ?

As described above, in a market economy, self-interested pursuit of profit motivates people to decide what goods and service to produce. The profit is high if the consumer is willing to buy the goods and service for higher price that the cost of goods and service. Every economic decision that a consumer makes takes account of utility (the amount of benefit to the consumer) and comes with an opportunity cost [3]. Opportunity cost is the benefit, profit, or value of something that must be given up in order to have something else.

Opportunity cost

Let's understand opportunity cost by examining the problem of whether I should have done PhD after my Masters. After my Masters, I had two options:

Accept the job offer for Software Development Engineer (SDE) position in Microsoft SQL Server Data Mining team. The salary of this position is $ x per year.
Do 5-year PhD program at Rice university. Rice University gives PhD students a stipend of $ y per year. After PhD, I got an offer from IBM Research with a salary of $ x + z per year.

Assuming the compensation at Rice, Microsoft and IBM remain constant, the opportunity cost associated with doing a PhD after n years is $ x*n - (x + z)*(n - 5) - 5y. The below figure (generated by crude approximation) shows that it makes sense to do PhD if n > 10.5, where the opportunity cost becomes 0.

The above example ignores the rate of increase of compensation of a Masters vs PhD student and also the intangible costs such as experience at Microsoft, potential cutting-edge role at IBM, guidance under Chris, etc.

Specialization and Voluntary Trade

In the book The Wealth of Nations, Adam Smith highlights that two key points:

Workers produce more when they occupy specialized roles, so businesses can offer higher quality products at lower prices.
Trade benefits both buyer and seller (i.e. lower opportunity cost for both of them). Note, this is only true iff both the buyer and seller are engaged in the trade voluntarily and if both have complete information. The counter-examples for voluntary exchange are slavery, human trafficking and fraud, where government regulations are necessary.

To summarize, a corporation should attempt to:
Step 1: specialize such that it is the lowest cost provider
Step 2: apply opportunity cost in decision making, risk assessment (risk vs return) and pricing analysis (for example: if supply is high, price is $1 less than second cheapest seller and if supply is low, the price is $1 more than second highest bidder/buyer).
Step 3: exchange between two opportunity cost such that both parties are better off (i.e. producer sells to make profit and the consumer buys as it is more expensive to make it).

Monetary system

Except countries like Germany, South Korea, Switzerland, Norway, etc, the government spends more than its revenues. The difference is called as deficit spending or budget deficit or simply deficit.

To pay for the deficit, the Treasury borrows the money from the banks by issuing them treasury bonds. Treasury bonds offer interests and promise by the government to pay the principal and hence are considered as safest investments. However, in recent times, countries like Ukraine, Argentina, Greece, Zimbabwe and Russia were either defaulted or had to restructure their debts (see wikipedia for the exhaustive list). The big three rating agencies - Standard & Poor's, Moody's and Fitch - rate countries on their ability to pay interest, refinance and repay them.

Then through Open Market Operations (or OMO), the banks gets to sell the treasury bonds (or mortgage-backed securities) to the Federal Reserve (or Fed for short) at a profit. Interesting point to note here: Banks give Fed the treasury bonds and Fed adds "credit" to the bank reserves. As countries' central bank, Fed has unique power to create credit out of thin air without having money to back the credit up ... often referred to as "printing money".

Other than buying treasury bonds, Fed has another tool at its disposal: the reserve requirement (or cash reserve ratio or liquidity ratio), which refers to percentage of cash that must be kept at Fed or in the vault by a commercial. If a bank makes too many loans or if there is a sudden demand for withdrawals, then amount on money in the bank's vault falls below the reserve requirements. In this case, the bank first attempts to borrow money from other banks at the fed funds rate and then the remaining amount from the Fed at the discount rate. An equivalent interbank lending rate to feds funds rate in UK is LIBOR rate - the London Interbank Offered Rate (set by the British Banker's Association).

In practice, there is no standard fed funds rate, instead it is the weighted average of the interest rate banks charge each other overnight to meet the reserve requirements. The Federal Open Market Committee (FOMC) meets eight times a year to set the target fed funds rate and then by buying/selling of treasury bonds, it attempts to achieve the target fed funds rate. The Fed funds rate affect the prime rate (for setting home equity lines of credit, auto loans, personal loans and credit card rates) and 11th District cost of funds index (for setting adjustable-rate mortgages).

	Jan 2018	Dec 2017	Jan 2017
Prime rate reported by the Wall Street Journal's bank survey	4.5	4.25	3.75
Discount rate	2	1.75	1.25
Fed funds rate	1.5	1.5	0.75
11^th District cost of funds	0.746	0.737	0.603

Treasury then can use the money it received by selling the bonds to fund the deficit. Bank can use the credit it received from Fed and the money deposited by its customers minus the reserve requirements to make loans. This is called as fractional reserve lending and it works as follows. Let's assume that reserve requirements is 10% and X deposits $1000 in bank A. The bank A keeps $100 in its vault and lends remaining $900 to Y who needs it to fund her business. Y has an account in bank B and deposits $900 in her account which she intends to remove as and when needed. Bank B keeps $90 in its vault and lends remaining $810 to C who in turns deposits in bank W. Now, when A, B and C check their accounts in their respective banks, they will see that they have $1000, $900 and $810 respectively, that is the initial $1000 has grown to $2710 in the economy. This system fails if there is a mass panic and A, B and C all decide to remove the amount at once (commonly referred to as bank run).

Before we move ahead, it is a good idea to classify different types of banks:

Supranational banks:

The International Monetary Fun (IMF) ensures the stability of the international economic system and has 187 member countries.
World bank provides financial help to combat poverty, malnutrition and lack of communications and infrastructure to the developing countries.

Central banks such as Federal Reserve, European Central Bank, Bank of England, Reserve Bank of India (RBI), Swiss National Bank and The People's Bank of China (PBC), set monetary policy described in the below section.
Commercial bank's business model is to charge higher interest rate for loan than they pay depositors.
Investment banks provide three functions: Merger and Acquisition (M&E) facility for corporations to raise debt or equity, Institutional Client Services which includes Fixed Income, Commodities and Currencies (FICC) and Asset/Wealth Management for high net worth clients.

Government intervention

The government keeps the economy growing (close to long-run trend rate of 2.5%), limits unemployment and keep inflation low (inflation target of 2%) by fiscal and monetary policies:

Fiscal policy: is set by the US Congress, the President and the Treasury Secretary, which involves changing government spending and taxation.
Monetary policy: is set by the central bank (i.e. Federal Reserve in US), which involves influencing the demand and supply of money, primarily through the use of interest rates.

Inflation refers to a general increase in the price of goods and services and is caused by increase in the supply of money. Though it is hard to define/measure precisely, it is commonly measured by consumer price index (or CPI) which is weighted average change in the prices of consumer goods and services. Government love inflation but just to the right amount (~2%) as it makes debt cheaper (as dollar becomes cheaper), increases taxes (as prices increases) and increases exports. On the other hand, fixed income retirees and importers (such as oil) hate inflation.

A country is in recession when two successive quarters, or a six months, show a decrease in real GDP. Hence, during recession, we get a fall in the inflation rate. However, deflation is when we get a negative inflation rate i.e. falling prices. Since the second world war, recessions have not led to deflation – just a lower inflation rate. A depression is a severe recession.

			Fiscal Policy		Monetary Policy
		Set by:	US Congress, President and Treasury Secretary		Federal Reserve
		Measured by	Tax	Spending	Interest Rates	Treasury Bonds	Reserve requirements
Inflation	Prices ↑	Consumer Price Index (CPI)	↑	↓	↑	Sell	Increase
Recession	Low output and unemployment	GDP/GNP, Unemployment rate	↓	↑	↓	Buy	Decrease

For example: to reduce inflationary pressures,

a fiscal policy will involve higher taxes and lower spending. The advantage of using fiscal policy is that it will help to reduce the budget deficit, but is often difficult to implement due to political reasons.
a monetary policy (in this case termed as tight monetary policy) will attempt to reduce the supply of money by raising the interest rates and borrowing rates among banks, selling government bonds (also called as "open market operations" or OMO) and by increasing the rate of minimum reserve to be kept by commercial banks towards central bank.

Extreme situations:

Buying too many treasury bonds: During extreme recession or to avoid deflation, Fed buys lots of government debt so as to inject more money into economy and encourage growth. This is called as Quantitative Easing as was pioneered by the Bank of Japan in 1990s. In 2009 financial crisis, Fed and Bank of England purchased over $2 trillion and $520 billions government debt respectively.
Comprehensive Financial Reform Acts:

After 1929 stock market crash, Glass Steagall Act was enacted which separated investment banking from commercial/retail banking to prevent bank run. This act was repealed in 1999 by the Gramm-Leach-Bliley Act allowing the banks to invest depositors' funds in unregulated derivatives. This led to banks becoming too big to fail and requiring bailout in 2008-2009 to avoid another depression.
After 2008-2009 housing crisis, Dodd-Frank Act was passed which increased regulations on unregulated derivatives trading and created Consumer Finance Protection Bureau that took actions against predatory student lending practices and payday loan scams. Much of these regulations might be rolled back Financial CHOICE Act is enacted.

Stagflation (= Inflation + Recession at the same time) is a condition of slow economic growth and relatively high unemployment accompanied by a rise in prices, or inflation. It occurs due to supply shock (such rapid increase in price of oil) and due to policies that harm industry while growing money supply too quickly. Venezuela is an example of stagflation where if you fix inflation, you create depression and if you fix recession, you create hyperinflation.
When inflation rate increases by ~ 50% monthly, it is referred to as hyperinflation. Germany after Treaty of Versailles (322% inflation rate), Hungary in 1945 (19,000% inflation rate) and Zimbabwe in 2008 (230 million % inflation rate) had some of the examples of hyperinflation.

Economist track the economic growth, unemployment and inflation using the following three measurements:

1. Gross Domestic Product (GDP): is the value of all the finished goods and services (hence product) produced within a country's border (hence domestic) in a specific time period, usually a year.

- Since it is "gross", the depreciation in the capital asset is not deducted. If it where deducted, then it becomes "net".

- Since it is "domestic", it does not take into account the country's earning outside its geographical boundaries, or foreign remittances. Gross National Product (GNP) is GDP + income earned by residents from overseas investments - income earned within the domestic economy by overseas residents.

- Since it is "product", the intermediate goods are not taken into account. For example, wheat sold for final consumption to consumers will be included into GDP, but not the amount of wheat sold to bakeries for the production of bread.

- GDP is commonly used as an indicator of the economic health of a country and can be computed in three principal ways: from total output, from income and from expenditure.

2. Unemployment rate: is the number of unemployed people is divided by the number of people in the labor force times 100. Note, it defines unemployed people as those who are willing and available to work, and who have actively sought work within the past four weeks. Hence, little kids don't count, neither do people who are unable to work due to their age or disability or who chose not to work. The below figure shows the unemployment rate per country in 2013:

3. Inflation rate: measured by consumer price index (or CPI).

Reference:

1. Economics - Hubbard and O'Brien.
2. https://www.investopedia.com/
3. The Book of Money - Conaghan and Smith.
4. Macroeconomics: Crash Course: https://www.youtube.com/watch?v=d8uTB5XorBw
5. The Monetary System Visually Explained: https://www.youtube.com/watch?v=23DNe0cJhcU
6. https://www.thebalance.com/open-market-operations-3306121
7. https://www.bankrate.com/rates/interest-rates/prime-rate.aspx
8. https://www.thebalance.com/dodd-frank-wall-street-reform-act-3305688

Thursday, November 30, 2017

How to debug the unicode error in Py4J

Many Java-based big data systems such as SystemML, Spark's MLLib, CaffeOnSpark, etc that use the Py4J bridge in their Python APIs often throw a cryptic error instead of detailed Java stacktrace:

py4j.protocol.Py4JJavaError: < exception str() failed >

For more details on this issue, please refer to Python's unicode documentation.

Luckily, there is a very simple hack to extract the stack trace in such a situation. Let's assume that SystemML's MLContext is throwing the above Py4J error: ml.execute(script). In this case, simply wrap that code in the following try/except block:

Monday, October 16, 2017

Git notes for SystemML committers

Replace niketanpansare below with your git username:

1. Fork and then Clone the repository

https://github.com/niketanpansare/systemml.git

2. Update the configuration file: systemml/.git/config

[core]
symlinks = false
repositoryformatversion = 0
filemode = false
logallrefupdates = true
[remote "origin"]
url = https://github.com/niketanpansare/systemml.git
fetch = +refs/heads/*:refs/remotes/origin/*
[branch "master"]
remote = origin
merge = refs/heads/master
[remote "upstream"]
url = https://github.com/apache/systemml.git
fetch = +refs/heads/*:refs/remotes/upstream/*
[remote "apache"]
url = https://gitbox.apache.org/repos/asf/systemml.git
fetch = +refs/heads/*:refs/remotes/apache/*

The above configuration states that: "origin" points to your fork and "upstream" points to the github mirror of the main repository, which is referred to as "apache".

Optional but recommended: Update the ~/.gitconfig file:

[user]
name = Niketan Pansare
email = npansar@us.ibm.com
[alias]
systemml-pr = "!f() { git fetch ${2:-upstream} pull/$1/head:pr-$1 && git checkout pr-$1; }; f"

3. Now, let's you created a branch 'foo' with 3 commits. You can create a PR via github website

4. Once the PR is approved and if you are the committer, you can merge the PR:

git checkout foo
git reset --soft HEAD~3 && git commit
git pull --rebase apache master
git checkout master
git pull apache master
git merge foo
gitk
git log apache/master..
git commit --amend --date="$(date +%s)"
git push apache master

Let's discuss the above commands in more detail:
- git checkout foo: ensures that you are working on the right branch.
- git reset --soft HEAD~3 && git commit: squashes all your commits into one single commit in preparation for the merge.
- git pull --rebase apache master: rebases your local master branch with the apache.
- git checkout master: switch to the local master branch
- git merge foo: merge the branch foo with local master
- gitk: to double-check visually if local and remote branches are all in synch and if the commit is not creating weird loops.

- git log apache/master..: double-check if the log makes sense and if you have added "Closes #PR." at the end of the log and correct JIRA number as the header.
- git commit --amend --date="$(date +%s)": if not, allows you to modify your log message.
- git push apache master: go ahead and push your commit to apache :)

5. If you are a committer and want to merge a PR (let's assume PR 100), you need to create a local branch using the following command:

git systemml-pr 100
git checkout pr-100

Also, if you had to squash the commits as described above, please use the below command to amend the author:

git commit --amend --author="FirstName LastName <contributor@blah.com>"

6. If you have modified the docs/ folder, you can update the documentation:

git subtree push --prefix docs apache gh-pages

Before, updating the documentation, you can sanity test on your branch by:

git subtree push --prefix docs origin gh-pages

If the above command fails, you can delete the gh-pages branch and re-execute the command:

git push origin --delete gh-pages

Advanced commands:
1. Cherry-picking:

git log --pretty=format:"%h - %an, %ar : %s" | grep Niketan
git checkout gpu
git cherry-pick

Notes from other committers:
Deron: https://gist.github.com/deroneriksson/e0d6d0634f3388f0df5e#integrate-pull-request-with-one-commit-into-apache-master
Mike: https://gist.github.com/dusenberrymw/78eb31b101c1b1b236e5

Friday, November 13, 2015

Using Google's TensorFlow for Kaggle competition

Recently, Google Brain team released their neural network library 'TensorFlow'. Since Google has a state-of-the-art Deep Learning system, I wanted to explore TensorFlow by trying it out for my first Kaggle submission (Digit Recognition) . After spending few hours getting to know their jargons/APIs, I modified their multilayer convolution neural network and came 140th (with 98.4% accuracy) ... not too shabby for my first submission :D.

If you are interested in outscoring me, apply cross-validation on the below python code or may be consider using ensembles:

from input_data import *
import pandas

class DataSets(object):
 pass

mnist = DataSets()
df = pandas.read_csv('train.csv')

train_images = numpy.multiply(df.drop('label', 1).values, 1.0 / 255.0)
train_labels = dense_to_one_hot(df['label'].values)

#Add MNIST data from Yan LeCun's website for better accuracy. We hold out test, just for accuracy sake, but could have easily added it :)

mnist2 = read_data_sets("/tmp/data/", one_hot=True)
train_images = numpy.concatenate((train_images, mnist2.train._images), axis=0)
train_labels = numpy.concatenate((train_labels, mnist2.train._labels), axis=0)

VALIDATION_SIZE = 5000
validation_images = train_images[:VALIDATION_SIZE]
validation_labels = train_labels[:VALIDATION_SIZE]
train_images = train_images[VALIDATION_SIZE:]
train_labels = train_labels[VALIDATION_SIZE:]

mnist.train = DataSet([], [], fake_data=True)
mnist.train._images = train_images
mnist.train._labels = train_labels

mnist.validation = DataSet([], [], fake_data=True)
mnist.validation._images = validation_images
mnist.validation._labels = validation_labels

df1 = pandas.read_csv('test.csv')
test_images = numpy.multiply(df1.values, 1.0 / 255.0)
numTest = df1.shape[0]
test_labels = dense_to_one_hot(numpy.repeat([1], numTest))
mnist.test = DataSet([], [], fake_data=True)
mnist.test._images = test_images
mnist.test._labels = test_labels

import tensorflow as tf
sess = tf.InteractiveSession()
x = tf.placeholder("float", [None, 784])   # x is input features
W = tf.Variable(tf.zeros([784,10]))           # weights 
b = tf.Variable(tf.zeros([10]))                    # bias
y_ = tf.placeholder("float", [None,10])   # y' is input labels
#Weight Initialization
def weight_variable(shape):
  initial = tf.truncated_normal(shape, stddev=0.1)
  return tf.Variable(initial)
  
def bias_variable(shape):
  initial = tf.constant(0.1, shape=shape)
  return tf.Variable(initial)
  
# Convolution and Pooling
def conv2d(x, W):
  return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')

def max_pool_2x2(x):
  return tf.nn.max_pool(x, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')
  
# First Convolutional Layer
W_conv1 = weight_variable([5, 5, 1, 32])
b_conv1 = bias_variable([32])
x_image = tf.reshape(x, [-1,28,28,1])
h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1)
h_pool1 = max_pool_2x2(h_conv1)

# Second Convolutional Layer
W_conv2 = weight_variable([5, 5, 32, 64])
b_conv2 = bias_variable([64])
h_conv2 = tf.nn.relu(conv2d(h_pool1, W_conv2) + b_conv2)
h_pool2 = max_pool_2x2(h_conv2)

# Densely Connected Layer
W_fc1 = weight_variable([7 * 7 * 64, 1024])
b_fc1 = bias_variable([1024])
h_pool2_flat = tf.reshape(h_pool2, [-1, 7*7*64])
h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1)

# Dropout
keep_prob = tf.placeholder("float")
h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob)

# Readout Layer
W_fc2 = weight_variable([1024, 10])
b_fc2 = bias_variable([10])
y=tf.nn.softmax(tf.matmul(h_fc1_drop, W_fc2) + b_fc2)
cross_entropy = -tf.reduce_sum(y_*tf.log(y))

train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)
tf.initialize_all_variables().run()
for i in range(20000):
 batch = mnist.train.next_batch(50)
 train_step.run(feed_dict={x: batch[0], y_: batch[1], keep_prob: 0.5})

correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))
print "test accuracy %g"%accuracy.eval(feed_dict={x: mnist2.test.images, y_: mnist2.test.labels, keep_prob: 1.0})
prediction = tf.argmax(y,1).eval(feed_dict={x: mnist.test.images, keep_prob: 1.0})

f=open('prediction.txt','w')
s1='\n'.join(str(x) for x in prediction)
f.write(s1)
f.close()

The above script assumes that you have downloaded train.csv and test.csv from Kaggle's website, input_data.py from TensorFlow's website and installed pandas/TensorFlow.

Here are few observations based on my experience playing with TensorFlow:

TensorFlow does not have an optimizer:

TensorFlow statically maps an high-level expression (for example "matmul") to a predefined low-level operator (for example: matmul_op.h) based on whether you are using CPU or GPU enabled TensorFlow. On other hand, SystemML compiles a matrix multiplication expression (X %*% y) into one of many matrix-multiplication related physical operators using a sophisticated optimizer that adapts to the underlying data and cluster characteristics.
Other popular open-source neural network libraries are Caffe, Theano and Torch.

TensorFlow is a parallel, but not a distributed system:

It does parallelize its computation (across CPU cores and also across GPUs):
Google has released only the single-node version and kept distributed version in-house. This means that the open-sourced version does not have a parameter server.

TensorFlow is easy to use, but difficult to debug:

I like TensorFlow's Python API and if the script/data/parameters are all correct, it works absolutely fine :)
But, if something fails, the error messages thrown by TensorFlow are difficult to decipher. This is because the error messages point to a generated physical operator (for example: tensorflow.python.framework.errors.InvalidArgumentError: ReluGrad input), not to the line of code in the Python program.

TensorFlow is slow to train and not yet robust enough:

Here are some initial numbers by Alex Smola comparing TensorFlow to other open-source deep learning systems: