Thursday, February 02, 2012

Cross-validation on SVM-Light

Before reading this blogpost you should read: Practical guide to libSVM, Wikipedia entry on SVM, Wikipedia entry on cross-validation, and PRML - Bishop's chapter on SVM.  In this post, I will not explain how to use open source SVM tools (like libsvm or svmlight), but would rather assume that you have used them and want some clever script to automate them. Also note that, the below script uses default parameters. So if you get poor results, you can add an outer for loop that iterates through your choice of parameters and supplies them to the program ./svm_learn.

To perform cross-validation on libsvm:
  1. Download and extract it
  2. You need gnuplot installed on the machine you want to run libsvm. If you want to run it on the server, do 'ssh -X myuserName@myMachineName'
  3. make
  4. Go to Tools folder and run 'python easy.py my_training_file my_test_file'
To perform cross-validation on svmlight:
  1. Download and extract it. There you will find two important binaries: svm_learn and svm_classify
  2. Unfortunately the data gathering using svmlight is not easy, so I have coded following script. 
#!/bin/bash
# Usage: svm_light_cross_validation full_data num_train_items num_test_items k_cycles results_file
RESULT_FILE=$5

echo "Running SVM-Light via cross validation on" $1 "by using" $2 "training items and" $3 "test items (Total number of cross-validation cycles:" $4 > $RESULT_FILE

MODEL_FILE="model."$RANDOM".txt"
TEMP_FILE="tempFile."$RANDOM".txt"
PRED_FILE="prediction."$RANDOM".txt"
DATA_FILE=$1
NUM_TRAIN=$2
NUM_TEST=$3
NUM_CYCLES=$4

TEMP_DATA_FILE=$DATA_FILE"."$RANDOM".temp"
TRAIN_FILE=$TEMP_DATA_FILE".train"
TEST_FILE=$TEMP_DATA_FILE".test"

TEMP_RESULT=$RESULT_FILE".temp"
SVM_PATTERN='s/Accuracy on test set: \([0-9]*.[0-9]*\)% ([0-9]* correct, [0-9]* incorrect, [0-9]* total)\.*/'
for k in `seq 1 $NUM_CYCLES`
do
 sort -R $DATA_FILE > $TEMP_DATA_FILE
 head -n $NUM_TRAIN $TEMP_DATA_FILE > $TRAIN_FILE
 tail -n $NUM_TEST $TEMP_DATA_FILE > $TEST_FILE
 
 echo "------------------------------------------"  >> $RESULT_FILE
 echo "Cross-validation cycle:" $k >> $RESULT_FILE
 
 # first run svm with default parameters
 echo ""  >> $RESULT_FILE
 echo "Polynomial SVM with default parameters" >> $RESULT_FILE
 for i in 1 2 3 4 5 6 7 8 9 10
 do
  echo "order:" $i >> $RESULT_FILE
  ./svm_learn -t 1 -d $i $TRAIN_FILE $MODEL_FILE > $TEMP_FILE
  ./svm_classify -v 1 $TEST_FILE $MODEL_FILE $PRED_FILE > $TEMP_RESULT
  cat $TEMP_RESULT >> $RESULT_FILE
  sed '/^Reading model/d' $TEMP_RESULT > $TEMP_RESULT"1"
  sed '/^Precision/d' $TEMP_RESULT"1" > $TEMP_RESULT
  sed "$SVM_PATTERN$k poly $i \1/g" $TEMP_RESULT >> "better"$RESULT_FILE
 done

 echo ""  >> $RESULT_FILE
 echo "RBF SVM with default parameters" >> $RESULT_FILE
 for g in 0.00001 0.0001 0.001 0.1 1 2 3 5 10 20 50 100 200 500 1000
 do
  echo "gamma:" $g >> $RESULT_FILE
  ./svm_learn -t 2 -g $g $TRAIN_FILE $MODEL_FILE > $TEMP_FILE
  ./svm_classify -v 1 $TEST_FILE $MODEL_FILE $PRED_FILE >> $TEMP_FILE
  cat $TEMP_RESULT >> $RESULT_FILE
  sed '/^Reading model/d' $TEMP_RESULT > $TEMP_RESULT"1"
  sed '/^Precision/d' $TEMP_RESULT"1" > $TEMP_RESULT
  sed "$SVM_PATTERN$k rbf $g \1/g" $TEMP_RESULT >> "better"$RESULT_FILE
 done

done

rm $MODEL_FILE $TEMP_FILE $PRED_FILE $TEMP_DATA_FILE $TEMP_RESULT $TEMP_RESULT"1"
echo "Done." >> $RESULT_FILE

Here is how you run this script:
./svm_light_cross_validation myDataInSVMLightFormat.txt 5000 2000 10 MyResults.txt

myDataInSVMLightFormat.txt looks like "binaryClassifier(-1 or 1) featureNum:value ....". Example:
1 1:1.75814e-14 2:1.74821e-05 3:1.37931e-08 4:1.25827e-14 5:4.09717e-05 6:1.28084e-09 7:2.80137e-22 8:2.17821e-24 9:0.00600121 10:0.002669 11:1.19775e-05 12:8.24607e-15 13:8.36358e-10 14:0.0532899 15:3.25028e-06 16:0.148439 17:9.76215e-08 18:0.00148927 19:3.69801e-16 20:4.20283e-16 21:7.9941e-05 22:5.7593e-09 23:0.000251052 24:0.000184218 25:7.07359e-06 

After running this script you will get two important files: MyResults.txt and betterMyResults.txt (which is a space separated file with format: 'validation_cycle_num [poly or rbf] parameterValue accuracy'). You can then import betterMyResults.txt in R or Matlab and get some pretty graphs :)

Wednesday, February 01, 2012

Using VIM for "fast" data transformation

Recently, I had to give my data to a lab-mate. I had done some pre-processing on 20 Newsgroup dataset and converted it to following format:
- Each line represented a document.
- Words were replaced by an integer that denoted it's index in vocabulary.txt file
- The line was append with a binary classifier (1 or -1) based on whether the document was in "comp" newsgroup or "sci" newsgroup.
- The words were represented along with their frequency.
For example, let's assume for now the word "technology" is 1002th word in the vocabulary. Then, if the word "technology" appeared 10 times in 12th document (in the folder comp.graphics), then 12th line in my file will be:
1 1002:10 ...

Here is a sample line in the file:
1 3798:1 9450:1 12429:1 13352:1 14155:1 15858:1 22319:1 29652:1 31220:1 34466:2 35883:1 37734:1 40188:1 40683:1
1 90:1 1078:1 2101:2 3775:1 5183:2 5195:1 5747:1 7908:1 7963:1 9748:1 11294:3 14879:1 16006:2 18188:1 19742:1 20928:1 21321:1 21935:1 23613:1 25354:2 26721:1 29652:1 30407:1 34054:1 36546:2 38252:1 39376:2 40204:1
Now, I had to convert this to following format:
fileNum | wordIndex | frequency
For example:
3|3798|1
3|9450|1
3|12429|1
3|13352|1
3|14155|1
I used following VIM tricks:
1. First delete the binary classifiers (and also blank spaces at the end of the line):

:%s/^-1//g
:%s/^1//g
:%s/ $//g
2. Then, comes the most important part: Insert line number before every word.
To do this, I used following trick: - First replace spaces by some arbitrary character, in my case '='
- Then, replace that character by the line number !!!
:%s/ / =/g
:g/=/exec "s/=/ ".line(".")."|/g"
The last command needs some explanation: 
It says, execute a command globally whenever you see '=' character: :g/=/exec some-command
In our case the command is replacing that character by the line number. The line number of a given line can be found using VIM's internal function: line(".")
Hence, the input file is converted into following format:
3|3798:1  3|9450:1  3|12429:1  3|13352:1  3|14155:1  3|15858:1  3|22319:1  3|29652:1  3|31220:1  3|34466:2  3|35883:1  3|37734:1  3|40188:1  3|40683:1
  4|90:1  4|1078:1  4|2101:2  4|3775:1  4|5183:2  4|5195:1  4|5747:1  4|7908:1  4|7963:1  4|9748:1  4|11294:3  4|14879:1  4|16006:2  4|18188:1  4|19742:1  4|20928:1  4|21321:1  4|21935:1  4|23613:1  4|25354:2  4|26721:1  4|29652:1  4|30407:1  4|34054:1  4|36546:2  4|38252:1  4|39376:2  4|40204:1
3. Now, to some garbage cleaning and trivial replacement:
:%s/:/|/g
:%s/  / /g
:%s/^ //g
4. Finally, replace spaces by a newline character:
:%s/ /!!!!!!/g where !!!!!! is press Control+V and then press Enter 
Control-V is the special character escape key. Remember the character '\n' won't work here.

And voila you are done !!!

Tuesday, January 17, 2012

Iris - siri like software

Recently I have started a pet project Iris1 while doing background study for my recent paper.

Voice commands/response:
Like Siri, it takes user commands and can respond using either Voice or Text.


Portable:
Since it is written in Java, it can run on most OS, including Mac, Windows and Linux. I have only tested it on Mac Lion, so I welcome any beta testers using other OS. For now, some of the features like play itunes, open email, etc are only available on Mac. I plan to add these features on other OS as well.

Configurable:
Most of the features such as use of VOICE or TEXT as feedback mechanism, as well as feedback message are configurable through the configuration file ''user_scripts/iris.config". You can also modify any of the predefined apple scripts (for eg: playSong.scpt) in that folder which will be called by Iris on the voice cammand. Also, for advance users, there are ten function commands that he/she can use to perform user-specific tasks.

Easy to install and use:
All you need to do is download and extract the binary file and run following command:
 
java -Xmx200m -jar bin/Iris.jar
The above command asks Java to run the program Iris.jar (in bin folder) and also specifies that it should not use more than 200 MB of memory.

Lightweight:
The whole setup takes about 20MB.

Client-side:
The speech-to-text is done on your laptop/desktop and not on some remote server. This means you can use it even if have no internet connection. This also means that there is no possibility of some big brother monitoring your voice commands.

Open source:
You are free to download and modify the source. As I am working towards a paper deadline, I am looking for collaborators that would help me with following features:
  1. User-friendly: Improving voice commands
  2. More accuracy: Expanding dictionary to include more words and also to add different 'user accents' for a given word.
  3. Expanding grammar to include arbitrary spoken words. 
  4. Apple scripts to accessing contacts and recognizing contact names to assist users with various functionality.
  5. Adding GUI interface rather than command line. Since all the responses go through OutputUtility.java, this is going to be easy.
  6. New features: Support for RSS, Top News, Weather, Popular city names, Dictionary/Theasurus, Web/Wikipedia search, Dictation (to text/messages/events/todo item/alarm/reminders), Jokes, Motivational quotes, More Personalization.
If you have suggestions for improvement or want to collaborate with me, email me at 'NiketanPansare AT GMail'.

Iris uses CMU's Sphinx4 for recognizing speech and FreeTTS for text-to-speech conversion.

References:
1 Iris is name of my friend's dog (well, no pun intended when I use the term 'pet project'). It is also the anagram of the word 'Siri' (a popular Apple's software for IPhone 4s).

Monday, January 02, 2012

C++ workflow

Recently I had a memory bug that took me 2 days to figure out. Valgrind gave no error message (well, to be honest, it did give out a warning that got lost in my log messages). gdb pointed to completely random place that led me to believe STL's stringstream was the culprit. Since there was no way I could have definitively say it's not the cause, I couldn't rule out that possibility.

My impatience got the best of me and I foolishly rewrote that part of code using a less elegant solution using arrays. Now gdb started pointing to STL's vector.clear() method and believe it or not I was using std::string to store in it, not some user defined class …. I had it … there is no way in the world I am going to believe that gdb was right (thanks to my confidence in Effective C++ and std::vector < str::string >). But, not knowing the bug in a 'single threaded' 'command-line based' C++ code with access to gdb and pretty printing was driving me crazy. My previous blogpost didn't help either :(

So, on the same evening, armed with my music player and hot cup of tea, I decided to ignore the bug and beautify my code with better comments … geek alert !!! That's when it hit me, due to my recent modification, I was writing past the array bounds in a for loop. The funny thing is it is one of the few memory bugs that valgrind does not report in its final assessment and my best guess is that I was writing over vector's header location (and vector only complains about that during its next operation (in my case was clear()) ... which for all intensive purposes is random).

Embarrassed by this stupid mistake that wasted my two days, I felt humbled to revisit my C++ work-flow.

So, here is a sample workflow:
- svn update
- Code using your favorite editor (Make sure you configure it to conform to your style rules).
- Set DEBUGFLAGS in your makefile that does static analysis (make my_proj).
- Test it on local machines using 'make my_test'
- Run valgrind if necessary and check against below given valgrind checklist
- make style
- make doc
- svn update
- svn commit -m "some meaningful message"

The key principle is:

Automate whenever possible (with emphasis on reusability)

  1. Never ever start coding without an SVN (or cvs or git, etc.). It will save you lot of headaches if your machine crashes or if you want to revert back to a previous version.
    If the project has multiple coders/testers/collaborators, it helps if you have some kind of project-management tool like trac or jira. Also, if you are going to release your code to public, you might want to consider project hosting sites like Google Code or source forge.
    Though this might seem obvious, don't host the svn server on personal machine or at least make sure to take regular backups if you do.
  2. Use a Makefile that takes care of all necessary tasks (compilation, cleaning, data generation, document generation, code beautification, etc) and a user-friendly README file. (Advance coders can consider automate). Here is a sample Makefile of my current project:
     
    CC = g++
    DEBUGFLAGS = -g -Wall -Wextra
    MINUITFLAGS = `root-config --cflags` `root-config --libs` -lMinuit2 -fopenmp
    LIBS = -lgsl -lgslcblas -lm -lstdc++
    
    HEADER = Helper.h MathHelper.h LDA.h Configuration.h
    SRC = main.cpp MathHelper.cpp Helper.cpp LDA.cpp
    OUTPUT = spoken_web
    
    spoken_web:     $(SRC) $(HEADER)
            $(CC) $(SRC) -o $(OUTPUT) $(DEBUGFLAGS) $(LIBS)
    
    doc:    $(HEADER) spokenweb_doxygen_configuration
            doxygen spokenweb_doxygen_configuration 
    
    r_data: GenerateData.R
            R --slave < GenerateData.R
    
    style: $(SRC) $(HEADER)
            astyle --options=spokenweb_astyle_rules $(SRC) $(HEADER)
    
    clean:
            rm -f *.o $(OUTPUT) *~ *.data parameters.txt metaFile.txt *.data
    
  3. Use a document generator like doxygen. It forces you to write much cleaner documentation and more importantly cleaner interface. Trust me it will help if you had to reuse your code in future.
    make doc
    
  4. Use code beautifier like astyle that not only shields you against inelegant colleague/coder, but also personalizes the code if you like to do 'Code Reading' (Don't you clean gym equipment after workout, well, why not do the same with your code). Here is sample style rules that I love:
     
    --style=java
    --indent=spaces=2
    --indent-classes
    --indent-switches
    --indent-preprocessor
    --indent-col1-comments
    --min-conditional-indent=0
    --max-instatement-indent=80
    --unpad-paren
    --break-closing-brackets
    --add-brackets
    --keep-one-line-blocks
    --keep-one-line-statements
    --convert-tabs
    --align-pointer=type
    --align-reference=type
    --lineend=linux
    --suffix=none
    
    If your company follows other style guidelines or if you don't want to change existing formatting, remove '--suffix=none' from the style rules and once you are done, you can restore the original unformatted files (*.orig).
    make style
    
  5. Familiarize yourself with program analyzers and debugging tools, and have scripts that automate them in your coding process:
    • g++'s flags (See DEBUGFLAGS in my makefile)
    • Static analysis tools like cppcheck (For more cases, I am satisfied with the above g++'s flags)
    • Runtime memory analyzer tool like valgrind. (Please see below given Valgrind checklist, and also my previous blogpost).
    • gdb: For stepping through the code for more comprehensive debugging
    There are several paid tools that do much better job of analyzing your code that those mentioned above. There are 
  6. I like to follow standard directory structure:
    ./           Makefile and configure scripts
    ./src       cc , h files (Use subdirectories and namespaces if necessary)
    ./bin      build directory
    ./data    data required for testing/running the code
    ./test      test drivers
    ./doc     documentation generated by doxygen
  7. Maintain a list of code snippets (stored in Evernote or a personal notebook) or helper classes that you might have to use again. Note that the helper classes should have good interface and must be well documented (and robust). For example, I use GSL for writing sampling functions for bayesian programs, which has very ugly interface and requires some amount of book-keeping. So, I have coded a wrapper class that gives an R like interface which makes my code extremely readable, less error prone and I don't need to worry about any book-keeping or other numerical issues while calling them.
     
    /// \defgroup StatDistribution R-like API for family of probability distributions
    /// Naming convention:
    /// I have followed the naming conventions of R programming language. 
    /// Please see its documentation for further details
    ///
    /// Here is a brief intro:
    /// Every distribution that R handles has four functions. There is a root name, 
    /// for example, the root name for the normal distribution is norm. 
    /// This root is prefixed by one of the letters
    ///   - d for "density", the density function
    ///   - r for "random", a random variable having the specified distribution
    /// Example: For the normal distribution, these functions are dnorm, and rnorm.
    ///
    /// Important note: Since most bayesian calculation is done in log space, (almost) 
    /// every density function is capable of outputing 'numerically robust' logarithmic value. 
    /// Hence, use dnorm(x, mean, sd, true) in your code rather than log(dnorm(x, mean, sd))
    ///
    /// Suported continuous distribution:
    /// - Uniform (unif - r,d)
    /// - Univariate Normal (norm - r, d); Truncated normal (tnorm - r,d);
    /// - Multivariate Normal (mnorm - r, d, conditional_r, conditional_rt) (--> Only m = 2, 3 supported)
    /// - Beta (beta - r,d)
    /// - Inverse Gamma (invgamma - r,d)
    ///
    /// Suported discrete distribution:
    /// - Discrete Uniform (dunif - r,d)
    /// - Dirichlet (dirichlet - r,d)
    /// - Multinomial (multinom - r,d)
    /// - Poisson (pois - r,d)
    /// @{
    
    /// @name Continuous Uniform Distribution
    ///@{
    /// Returns uniform variable [min, max)
    double runif(double min, double max);
    double dunif(double x, double min, double max, bool returnLog = false);
    ///@}
    
    /// @name Discrete Uniform Distribution
    ///@{
    /// Returns uniform variable [min, max) --> i.e. [min, max-1]
    int rdunif(int min, int max);
    double ddunif(int x, int min, int max, bool returnLog = false);
    ///@}
    
    /// @name Univariate Normal Distribution
    ///@{
    double rnorm(double mean, double sd);
    double dnorm(double x, double mean, double sd, bool returnLog = false);
    ///@}
    
    /// @name Multinomial distribution
    /// Both prob and retVal should be of length sizeOfProb
    /// size, say N, specifying the total number of objects that are put into 
    /// K (or sizeOfProb) boxes in the typical multinomial experiment.
    /// If the array prob is not normalized then its entries will be treated as 
    /// weights and normalized appropriately. Furthermore, this function also allows 
    /// you to provide log probabilities (in which case, you need to set isProbInLogSpace to true)
    /// Eg: double prob[3] = {0.1, 0.2, 0.7}; double retVal[3]; rmultinom(1, 3, prob, retVal, false);
    ///@{
    void rmultinom(int size, int sizeOfProb, double* prob, unsigned int* retVal, bool isProbInLogSpace = false);
    double dmultinom(int sizeOfProb, double* prob, unsigned int* x, bool isProbInLogSpace = false, bool returnLog = false);
    ///@}
    
    So to sample N(10,2), instead of having code like:
     
    gsl_rng* rng_obj = gsl_rng_alloc(gsl_rng_mt19937);
    double sampledVal = gsl_ran_gaussian(rng_obj, 2) + 10;
    gsl_rng_free(rng_obj);
    
    I have to use following code (which if you have coded in R is more readable):
     
    StatisticsHelper myStatHelper;
    double sampledVal = myStatHelper.rnorm(10, 2);

Valgrind checklist:

First run valgrind's memcheck tool (valgrind --tool=memcheck program_name) and check:
  1. The heap summary (only returns memory leaks, not memory corruption errors) Eg:
     
    ==19691== HEAP SUMMARY:
    ==19691==     in use at exit: 0 bytes in 0 blocks
    ==19691==   total heap usage: 12,126 allocs, 12,126 frees, 938,522 bytes allocated
    ==19691== 
    ==19691== All heap blocks were freed -- no leaks are possible
    
    If number of allocs and the number of frees will differ in o/p, the try
    valgrind --tool=memcheck --leak-check=yes --show-reachable=yes program_name

    Steps to avoid memory leaks:
    - If you are assigning pointers to other pointers, be very explicit as to who is responsible for the deletion of the memory.
    - Delete every object created by new (especially in case of exceptions).
  2. Intermediate warning/error messages irrespective of the heap summary. Eg:
     
    ==15633== Invalid write of size 4
    ==15633==    at 0x4163F3: LDAGibbsSampler::FillInSpeechToTextOutputs() (LDA.cpp:206)
    ==15633==    by 0x41565A: LDAGibbsSampler::LDAGibbsSampler(int, int, double, double, std::string, double, double, int) (LDA.cpp:28)
    ==15633==    by 0x402E26: main (main.cpp:34)
    
    Note, sometime one error can propagate and throw error messages at different places, so don't panic.

    The intermediate error messages can detect:
    • Illegal read / Illegal write errors (like the message 'Invalid read of size 4' above): It can mean any of the memory corruptions given below. If the error message is not informative, run valgrind again with '--read-var-info=yes' flag.
      Common memory corruption bugs like these are:
      - Writing/Accessing out of bounds memory (either directly or by array oriented functions like strcpy)
      - Using an unallocated object
       
      my_class* obj_ptr;
      obj_ptr -> var1 = val; // or val = obj_ptr -> var1;
      
      - Using a freed object
      Note: Memcheck (tool of valgrind) does not perform bounds checking for stack or global arrays, so even if you don't get this error/warning message, it does not mean that there is no memory corruption errors.
    • Use of uninitialised values (either on stack or heap) in user function: Message looks like 'Conditional jump or move depends on uninitialised value(s)'. To see information on the sources of uninitialised data in your program, use the '--track-origins=yes' option.
    • Use of uninitialised values in system calls: Message looks like 'Syscall param write(buf) points to uninitialised byte(s)'.
    • Illegal free/deletions: (Invalid free())
      - Calling free() (or delete) on the object that is already freed (or deleted).
      - Deleting object that is created on stack.
    • Mismatched operators: (Mismatched free() / delete / delete [])
      - Calling free() on the object that is created by new operator.
      - Calling delete on the object that is created by malloc().
      - Calling delete[] on the object that is created by the new operator.
    • Overlapping source and destination blocks: (Source and destination overlap in ...)
      - Look out functions like memcpy, strcpy, strncpy, strcat and strncat.

References:
- http://google-styleguide.googlecode.com/svn/trunk/cppguide.xml
- http://www.yolinux.com/TUTORIALS/C++MemoryCorruptionAndMemoryLeaks.html
- http://www.cprogramming.com/tutorial/memory_debugging_parallel_inspector.html

Tuesday, October 11, 2011

How to analyze algorithms ?

Important assumptions:

  •  This process of analysis should be independent of programming language and machine architecture. So you can't just write your algorithm in C++ and compare it using running time of some other algorithm written in Java. 
  • Also, since each algorithm can take as an input arbitrary data structure (eg: graph, array, binary object, tree, ...), there has to be some standard way to specify it. For sake of our analysis, we use input size (n) as the measure for our analysis. For eg: length of array, number of vertices or edges of the graph, number of nodes of a tree etc.
The detailed explanations of these two points is given in Introduction to Algorithms by Cormen, Leiserson, Rivest, .. (CLR)

Setup:

Input of size n ==> Algorithm that takes T(n) steps ==> some output

The number of steps in the algorithm often depends on factors other than the input size. For eg: the number of steps for quicksort for the input [1, 2, 3, 4, 5] is different than as compared to the input [5, 4, 3, 2, 1]; even though both have same input size (and in this case same elements).
Hence, we will try to find 3 functions (upper bound, lower bound and Theta bound) that can capture the effects of these factors: $$ g_1(n), g_2(n) \text{ and } g_3(n) \text{ such that } $$ $$ T(n) \in O(g_1(n)), T(n) \in \Omega(g_2(n)) \text{ and } T(n) \in \Theta(g_3(n))$$ Many authors also write above equations as: $$ T(n) = O(g_1(n)), T(n) = \Omega(g_2(n)) \text{ and } T(n) = \Theta(g_3(n))$$

Big-O notations:

Now, let's understand definition of above notations: \begin{align} \text{To prove that } & g_1(n) \text{ is upper bound of T(n), i.e.} T(n) \in O(g_1(n)) \\ \text{ find two constants } & c, n_0 \text{ such that } \\ & \forall n \geq n_0, T(n) \leq c g_1(n) \end{align} For lower bound, change ≤ sign to ≥ sign and for theta bound find both lower and upper bound: $$ T(n) \in \Theta(g_3(n)) \text{ iff } T(n) \in O(g_3(n)) \text{ and } T(n) \in \Omega(g_3(n))$$
To understand these in more details, I suggest you solve problems from CLR chapter 3. Here are some example you can start with: \begin{align} 2n^2 \in O(n^3) & \text{ with } c=1 \text{ and } n_0 = 2 \\ \sqrt{n} \in \Omega(lg n) & \text{ with } c=1 \text{ and } n_0 = 16 \\ \frac{n^2}{2} - 2n \in \Theta(n^2) & \text{ with } c_1=\frac{1}{4}, c_2=\frac{1}{2} \text{ and } n_0 = 8 \\ \end{align} (Note: theoretically very large functions such as infinity^n or n^infinity can always be lower bounds. So, it is necessary to supply as tight bound as possible for any given algorithm. Mathematically, the notations for tight upper bound is small o and for tight lower bound is small omega, but we will ignore them for sake of this discussion)

Common measures for algorithms:

Above three notations are often replaced by the terms 'worst case', 'best case' and 'average case' complexity. But, more precise definitions are as follows: \begin{align} \text{Worst case}& T(n) = max \{ T(x) | \text{ x is an instance of size n} \} \\ \text{Best case}& T(n) = min \{ T(x) | \text{x is an instance of size n} \} \\ \text{Average case}& T(n) = \sum_{\text{input }x} Prob(x) T(x) \end{align} This means, to find average case complexity of algorithm, you need to find probability of a given input and use it to find expectation of T(n).

Pseudo code to Big-O:

Now, let's address the key question: given a pseudo code, how do we find its lower bound (or upper bound or ...). Here is one possible way to find those bounds:
  • First, express the algorithm in recursive form
  • Determine appropriate metric for input size 'n'
  • Then, find recurrence relation for the algorithm
  • Solve the recurrence relation using either Substitution method, recursion tree method or the master theorem

For example, the recurrence relations for popular algorithms are: \begin{align} \text{Binary search: } & T(n) & = T(\frac{n}{2}) + O(1) \\ \text{Sequential search: } & T(n) & = T(n-1) + O(1) \\ \text{Tree traversal: } & T(n) & = 2T(\frac{n}{2}) + O(1) \\ \text{Selection sort: } & T(n) & = T(n-1) + O(n) \\ \text{Merge sort: } & T(n) & = 2T(\frac{n}{2}) + O(n) \end{align} Try writing the pseudo code for any of the above algorithm and verify the recurrence relations.

Master theorem:

Master theorem can only solve the recurrence relation of the form: T(n) = a T(n/b) + f(n). Here is my modus operandi for master theorem: \begin{align} \text{If } f(n)=n^{log_{b}a}lg^{k+1}n, & \text{ then } T(n) = \Theta(n^{log_{b}a} lg^{k+1}n) \\ \text{Else if } n^{log_{b}a} \text{ is polynomially greater than } f(n), & \text{ then } T(n) = \Theta(n^{log_{b}a}) \\ \text{Else if } f(n) \text{ is polynomially greater than } n^{log_{b}a}, & \text{ then } T(n) = \Theta(f(n)) \\ \text{Else use Substitution method}& \end{align}
Here are solutions to two important recurrence relations, that cannot be solved using Master theorem: \begin{align} \text{Recurrence relation} & & \text{Complexity} \\ T(n) = T(n-1) + b n^k & => & O(n^{k+1})\\ T(n) = cT(n-1) + b n^k & => & O(c^n) \end{align}
For recurrence relation of other forms, refer to CLR for substitution and recursion tree method.
(In short, for substitution method, you guess the solution and use induction to find the constants and to prove that the guessed solution is correct.
Recursion tree method are often used to find the guess for the substitution method)

Some useful approximations:

For non-negative real b,c,d:
\begin{align} \text{Polynomial growth rate:}& \sum_{i=1}^n i^c &= \Theta(n^{c+1}) \\ \text{Logarithmic growth rate:}& \sum_{i=1}^n \dfrac{1}{i} &= \Theta(log(n)) \\ \text{Exponential growth rate:}& \sum_{i=1}^n c^i &= \Theta(c^n) \\ \text{n-Log(n) growth rate:}& \sum_{i=1}^n log(i)^c &= \Theta(n log(n)^c) \\ \text{Combination of above:}& \sum_{i=1}^n log(i)^c i^{d} &= \Theta(n^{d+1} log(n)^c) \\ &\sum_{i=1}^n log(i)^c i^d b^{i} &= \Theta(n^{d+1} log(n)^c b^{n}) \\ \text{Stirling's approximation:}& n! &= \sqrt(2 \pi n) (\frac{n}{e})^n (1 + \Theta(\frac{1}{n})) \end{align}