Tuesday, October 11, 2011

How to analyze algorithms ?

Important assumptions:

  •  This process of analysis should be independent of programming language and machine architecture. So you can't just write your algorithm in C++ and compare it using running time of some other algorithm written in Java. 
  • Also, since each algorithm can take as an input arbitrary data structure (eg: graph, array, binary object, tree, ...), there has to be some standard way to specify it. For sake of our analysis, we use input size (n) as the measure for our analysis. For eg: length of array, number of vertices or edges of the graph, number of nodes of a tree etc.
The detailed explanations of these two points is given in Introduction to Algorithms by Cormen, Leiserson, Rivest, .. (CLR)

Setup:

Input of size n ==> Algorithm that takes T(n) steps ==> some output

The number of steps in the algorithm often depends on factors other than the input size. For eg: the number of steps for quicksort for the input [1, 2, 3, 4, 5] is different than as compared to the input [5, 4, 3, 2, 1]; even though both have same input size (and in this case same elements).
Hence, we will try to find 3 functions (upper bound, lower bound and Theta bound) that can capture the effects of these factors: $$ g_1(n), g_2(n) \text{ and } g_3(n) \text{ such that } $$ $$ T(n) \in O(g_1(n)), T(n) \in \Omega(g_2(n)) \text{ and } T(n) \in \Theta(g_3(n))$$ Many authors also write above equations as: $$ T(n) = O(g_1(n)), T(n) = \Omega(g_2(n)) \text{ and } T(n) = \Theta(g_3(n))$$

Big-O notations:

Now, let's understand definition of above notations: \begin{align} \text{To prove that } & g_1(n) \text{ is upper bound of T(n), i.e.} T(n) \in O(g_1(n)) \\ \text{ find two constants } & c, n_0 \text{ such that } \\ & \forall n \geq n_0, T(n) \leq c g_1(n) \end{align} For lower bound, change ≤ sign to ≥ sign and for theta bound find both lower and upper bound: $$ T(n) \in \Theta(g_3(n)) \text{ iff } T(n) \in O(g_3(n)) \text{ and } T(n) \in \Omega(g_3(n))$$
To understand these in more details, I suggest you solve problems from CLR chapter 3. Here are some example you can start with: \begin{align} 2n^2 \in O(n^3) & \text{ with } c=1 \text{ and } n_0 = 2 \\ \sqrt{n} \in \Omega(lg n) & \text{ with } c=1 \text{ and } n_0 = 16 \\ \frac{n^2}{2} - 2n \in \Theta(n^2) & \text{ with } c_1=\frac{1}{4}, c_2=\frac{1}{2} \text{ and } n_0 = 8 \\ \end{align} (Note: theoretically very large functions such as infinity^n or n^infinity can always be lower bounds. So, it is necessary to supply as tight bound as possible for any given algorithm. Mathematically, the notations for tight upper bound is small o and for tight lower bound is small omega, but we will ignore them for sake of this discussion)

Common measures for algorithms:

Above three notations are often replaced by the terms 'worst case', 'best case' and 'average case' complexity. But, more precise definitions are as follows: \begin{align} \text{Worst case}& T(n) = max \{ T(x) | \text{ x is an instance of size n} \} \\ \text{Best case}& T(n) = min \{ T(x) | \text{x is an instance of size n} \} \\ \text{Average case}& T(n) = \sum_{\text{input }x} Prob(x) T(x) \end{align} This means, to find average case complexity of algorithm, you need to find probability of a given input and use it to find expectation of T(n).

Pseudo code to Big-O:

Now, let's address the key question: given a pseudo code, how do we find its lower bound (or upper bound or ...). Here is one possible way to find those bounds:
  • First, express the algorithm in recursive form
  • Determine appropriate metric for input size 'n'
  • Then, find recurrence relation for the algorithm
  • Solve the recurrence relation using either Substitution method, recursion tree method or the master theorem

For example, the recurrence relations for popular algorithms are: \begin{align} \text{Binary search: } & T(n) & = T(\frac{n}{2}) + O(1) \\ \text{Sequential search: } & T(n) & = T(n-1) + O(1) \\ \text{Tree traversal: } & T(n) & = 2T(\frac{n}{2}) + O(1) \\ \text{Selection sort: } & T(n) & = T(n-1) + O(n) \\ \text{Merge sort: } & T(n) & = 2T(\frac{n}{2}) + O(n) \end{align} Try writing the pseudo code for any of the above algorithm and verify the recurrence relations.

Master theorem:

Master theorem can only solve the recurrence relation of the form: T(n) = a T(n/b) + f(n). Here is my modus operandi for master theorem: \begin{align} \text{If } f(n)=n^{log_{b}a}lg^{k+1}n, & \text{ then } T(n) = \Theta(n^{log_{b}a} lg^{k+1}n) \\ \text{Else if } n^{log_{b}a} \text{ is polynomially greater than } f(n), & \text{ then } T(n) = \Theta(n^{log_{b}a}) \\ \text{Else if } f(n) \text{ is polynomially greater than } n^{log_{b}a}, & \text{ then } T(n) = \Theta(f(n)) \\ \text{Else use Substitution method}& \end{align}
Here are solutions to two important recurrence relations, that cannot be solved using Master theorem: \begin{align} \text{Recurrence relation} & & \text{Complexity} \\ T(n) = T(n-1) + b n^k & => & O(n^{k+1})\\ T(n) = cT(n-1) + b n^k & => & O(c^n) \end{align}
For recurrence relation of other forms, refer to CLR for substitution and recursion tree method.
(In short, for substitution method, you guess the solution and use induction to find the constants and to prove that the guessed solution is correct.
Recursion tree method are often used to find the guess for the substitution method)

Some useful approximations:

For non-negative real b,c,d:
\begin{align} \text{Polynomial growth rate:}& \sum_{i=1}^n i^c &= \Theta(n^{c+1}) \\ \text{Logarithmic growth rate:}& \sum_{i=1}^n \dfrac{1}{i} &= \Theta(log(n)) \\ \text{Exponential growth rate:}& \sum_{i=1}^n c^i &= \Theta(c^n) \\ \text{n-Log(n) growth rate:}& \sum_{i=1}^n log(i)^c &= \Theta(n log(n)^c) \\ \text{Combination of above:}& \sum_{i=1}^n log(i)^c i^{d} &= \Theta(n^{d+1} log(n)^c) \\ &\sum_{i=1}^n log(i)^c i^d b^{i} &= \Theta(n^{d+1} log(n)^c b^{n}) \\ \text{Stirling's approximation:}& n! &= \sqrt(2 \pi n) (\frac{n}{e})^n (1 + \Theta(\frac{1}{n})) \end{align}