Monday, January 02, 2012

C++ workflow

Recently I had a memory bug that took me 2 days to figure out. Valgrind gave no error message (well, to be honest, it did give out a warning that got lost in my log messages). gdb pointed to completely random place that led me to believe STL's stringstream was the culprit. Since there was no way I could have definitively say it's not the cause, I couldn't rule out that possibility.

My impatience got the best of me and I foolishly rewrote that part of code using a less elegant solution using arrays. Now gdb started pointing to STL's vector.clear() method and believe it or not I was using std::string to store in it, not some user defined class …. I had it … there is no way in the world I am going to believe that gdb was right (thanks to my confidence in Effective C++ and std::vector < str::string >). But, not knowing the bug in a 'single threaded' 'command-line based' C++ code with access to gdb and pretty printing was driving me crazy. My previous blogpost didn't help either :(

So, on the same evening, armed with my music player and hot cup of tea, I decided to ignore the bug and beautify my code with better comments … geek alert !!! That's when it hit me, due to my recent modification, I was writing past the array bounds in a for loop. The funny thing is it is one of the few memory bugs that valgrind does not report in its final assessment and my best guess is that I was writing over vector's header location (and vector only complains about that during its next operation (in my case was clear()) ... which for all intensive purposes is random).

Embarrassed by this stupid mistake that wasted my two days, I felt humbled to revisit my C++ work-flow.

So, here is a sample workflow:
- svn update
- Code using your favorite editor (Make sure you configure it to conform to your style rules).
- Set DEBUGFLAGS in your makefile that does static analysis (make my_proj).
- Test it on local machines using 'make my_test'
- Run valgrind if necessary and check against below given valgrind checklist
- make style
- make doc
- svn update
- svn commit -m "some meaningful message"

The key principle is:

Automate whenever possible (with emphasis on reusability)

  1. Never ever start coding without an SVN (or cvs or git, etc.). It will save you lot of headaches if your machine crashes or if you want to revert back to a previous version.
    If the project has multiple coders/testers/collaborators, it helps if you have some kind of project-management tool like trac or jira. Also, if you are going to release your code to public, you might want to consider project hosting sites like Google Code or source forge.
    Though this might seem obvious, don't host the svn server on personal machine or at least make sure to take regular backups if you do.
  2. Use a Makefile that takes care of all necessary tasks (compilation, cleaning, data generation, document generation, code beautification, etc) and a user-friendly README file. (Advance coders can consider automate). Here is a sample Makefile of my current project:
     
    CC = g++
    DEBUGFLAGS = -g -Wall -Wextra
    MINUITFLAGS = `root-config --cflags` `root-config --libs` -lMinuit2 -fopenmp
    LIBS = -lgsl -lgslcblas -lm -lstdc++
    
    HEADER = Helper.h MathHelper.h LDA.h Configuration.h
    SRC = main.cpp MathHelper.cpp Helper.cpp LDA.cpp
    OUTPUT = spoken_web
    
    spoken_web:     $(SRC) $(HEADER)
            $(CC) $(SRC) -o $(OUTPUT) $(DEBUGFLAGS) $(LIBS)
    
    doc:    $(HEADER) spokenweb_doxygen_configuration
            doxygen spokenweb_doxygen_configuration 
    
    r_data: GenerateData.R
            R --slave < GenerateData.R
    
    style: $(SRC) $(HEADER)
            astyle --options=spokenweb_astyle_rules $(SRC) $(HEADER)
    
    clean:
            rm -f *.o $(OUTPUT) *~ *.data parameters.txt metaFile.txt *.data
    
  3. Use a document generator like doxygen. It forces you to write much cleaner documentation and more importantly cleaner interface. Trust me it will help if you had to reuse your code in future.
    make doc
    
  4. Use code beautifier like astyle that not only shields you against inelegant colleague/coder, but also personalizes the code if you like to do 'Code Reading' (Don't you clean gym equipment after workout, well, why not do the same with your code). Here is sample style rules that I love:
     
    --style=java
    --indent=spaces=2
    --indent-classes
    --indent-switches
    --indent-preprocessor
    --indent-col1-comments
    --min-conditional-indent=0
    --max-instatement-indent=80
    --unpad-paren
    --break-closing-brackets
    --add-brackets
    --keep-one-line-blocks
    --keep-one-line-statements
    --convert-tabs
    --align-pointer=type
    --align-reference=type
    --lineend=linux
    --suffix=none
    
    If your company follows other style guidelines or if you don't want to change existing formatting, remove '--suffix=none' from the style rules and once you are done, you can restore the original unformatted files (*.orig).
    make style
    
  5. Familiarize yourself with program analyzers and debugging tools, and have scripts that automate them in your coding process:
    • g++'s flags (See DEBUGFLAGS in my makefile)
    • Static analysis tools like cppcheck (For more cases, I am satisfied with the above g++'s flags)
    • Runtime memory analyzer tool like valgrind. (Please see below given Valgrind checklist, and also my previous blogpost).
    • gdb: For stepping through the code for more comprehensive debugging
    There are several paid tools that do much better job of analyzing your code that those mentioned above. There are 
  6. I like to follow standard directory structure:
    ./           Makefile and configure scripts
    ./src       cc , h files (Use subdirectories and namespaces if necessary)
    ./bin      build directory
    ./data    data required for testing/running the code
    ./test      test drivers
    ./doc     documentation generated by doxygen
  7. Maintain a list of code snippets (stored in Evernote or a personal notebook) or helper classes that you might have to use again. Note that the helper classes should have good interface and must be well documented (and robust). For example, I use GSL for writing sampling functions for bayesian programs, which has very ugly interface and requires some amount of book-keeping. So, I have coded a wrapper class that gives an R like interface which makes my code extremely readable, less error prone and I don't need to worry about any book-keeping or other numerical issues while calling them.
     
    /// \defgroup StatDistribution R-like API for family of probability distributions
    /// Naming convention:
    /// I have followed the naming conventions of R programming language. 
    /// Please see its documentation for further details
    ///
    /// Here is a brief intro:
    /// Every distribution that R handles has four functions. There is a root name, 
    /// for example, the root name for the normal distribution is norm. 
    /// This root is prefixed by one of the letters
    ///   - d for "density", the density function
    ///   - r for "random", a random variable having the specified distribution
    /// Example: For the normal distribution, these functions are dnorm, and rnorm.
    ///
    /// Important note: Since most bayesian calculation is done in log space, (almost) 
    /// every density function is capable of outputing 'numerically robust' logarithmic value. 
    /// Hence, use dnorm(x, mean, sd, true) in your code rather than log(dnorm(x, mean, sd))
    ///
    /// Suported continuous distribution:
    /// - Uniform (unif - r,d)
    /// - Univariate Normal (norm - r, d); Truncated normal (tnorm - r,d);
    /// - Multivariate Normal (mnorm - r, d, conditional_r, conditional_rt) (--> Only m = 2, 3 supported)
    /// - Beta (beta - r,d)
    /// - Inverse Gamma (invgamma - r,d)
    ///
    /// Suported discrete distribution:
    /// - Discrete Uniform (dunif - r,d)
    /// - Dirichlet (dirichlet - r,d)
    /// - Multinomial (multinom - r,d)
    /// - Poisson (pois - r,d)
    /// @{
    
    /// @name Continuous Uniform Distribution
    ///@{
    /// Returns uniform variable [min, max)
    double runif(double min, double max);
    double dunif(double x, double min, double max, bool returnLog = false);
    ///@}
    
    /// @name Discrete Uniform Distribution
    ///@{
    /// Returns uniform variable [min, max) --> i.e. [min, max-1]
    int rdunif(int min, int max);
    double ddunif(int x, int min, int max, bool returnLog = false);
    ///@}
    
    /// @name Univariate Normal Distribution
    ///@{
    double rnorm(double mean, double sd);
    double dnorm(double x, double mean, double sd, bool returnLog = false);
    ///@}
    
    /// @name Multinomial distribution
    /// Both prob and retVal should be of length sizeOfProb
    /// size, say N, specifying the total number of objects that are put into 
    /// K (or sizeOfProb) boxes in the typical multinomial experiment.
    /// If the array prob is not normalized then its entries will be treated as 
    /// weights and normalized appropriately. Furthermore, this function also allows 
    /// you to provide log probabilities (in which case, you need to set isProbInLogSpace to true)
    /// Eg: double prob[3] = {0.1, 0.2, 0.7}; double retVal[3]; rmultinom(1, 3, prob, retVal, false);
    ///@{
    void rmultinom(int size, int sizeOfProb, double* prob, unsigned int* retVal, bool isProbInLogSpace = false);
    double dmultinom(int sizeOfProb, double* prob, unsigned int* x, bool isProbInLogSpace = false, bool returnLog = false);
    ///@}
    
    So to sample N(10,2), instead of having code like:
     
    gsl_rng* rng_obj = gsl_rng_alloc(gsl_rng_mt19937);
    double sampledVal = gsl_ran_gaussian(rng_obj, 2) + 10;
    gsl_rng_free(rng_obj);
    
    I have to use following code (which if you have coded in R is more readable):
     
    StatisticsHelper myStatHelper;
    double sampledVal = myStatHelper.rnorm(10, 2);

Valgrind checklist:

First run valgrind's memcheck tool (valgrind --tool=memcheck program_name) and check:
  1. The heap summary (only returns memory leaks, not memory corruption errors) Eg:
     
    ==19691== HEAP SUMMARY:
    ==19691==     in use at exit: 0 bytes in 0 blocks
    ==19691==   total heap usage: 12,126 allocs, 12,126 frees, 938,522 bytes allocated
    ==19691== 
    ==19691== All heap blocks were freed -- no leaks are possible
    
    If number of allocs and the number of frees will differ in o/p, the try
    valgrind --tool=memcheck --leak-check=yes --show-reachable=yes program_name

    Steps to avoid memory leaks:
    - If you are assigning pointers to other pointers, be very explicit as to who is responsible for the deletion of the memory.
    - Delete every object created by new (especially in case of exceptions).
  2. Intermediate warning/error messages irrespective of the heap summary. Eg:
     
    ==15633== Invalid write of size 4
    ==15633==    at 0x4163F3: LDAGibbsSampler::FillInSpeechToTextOutputs() (LDA.cpp:206)
    ==15633==    by 0x41565A: LDAGibbsSampler::LDAGibbsSampler(int, int, double, double, std::string, double, double, int) (LDA.cpp:28)
    ==15633==    by 0x402E26: main (main.cpp:34)
    
    Note, sometime one error can propagate and throw error messages at different places, so don't panic.

    The intermediate error messages can detect:
    • Illegal read / Illegal write errors (like the message 'Invalid read of size 4' above): It can mean any of the memory corruptions given below. If the error message is not informative, run valgrind again with '--read-var-info=yes' flag.
      Common memory corruption bugs like these are:
      - Writing/Accessing out of bounds memory (either directly or by array oriented functions like strcpy)
      - Using an unallocated object
       
      my_class* obj_ptr;
      obj_ptr -> var1 = val; // or val = obj_ptr -> var1;
      
      - Using a freed object
      Note: Memcheck (tool of valgrind) does not perform bounds checking for stack or global arrays, so even if you don't get this error/warning message, it does not mean that there is no memory corruption errors.
    • Use of uninitialised values (either on stack or heap) in user function: Message looks like 'Conditional jump or move depends on uninitialised value(s)'. To see information on the sources of uninitialised data in your program, use the '--track-origins=yes' option.
    • Use of uninitialised values in system calls: Message looks like 'Syscall param write(buf) points to uninitialised byte(s)'.
    • Illegal free/deletions: (Invalid free())
      - Calling free() (or delete) on the object that is already freed (or deleted).
      - Deleting object that is created on stack.
    • Mismatched operators: (Mismatched free() / delete / delete [])
      - Calling free() on the object that is created by new operator.
      - Calling delete on the object that is created by malloc().
      - Calling delete[] on the object that is created by the new operator.
    • Overlapping source and destination blocks: (Source and destination overlap in ...)
      - Look out functions like memcpy, strcpy, strncpy, strcat and strncat.

References:
- http://google-styleguide.googlecode.com/svn/trunk/cppguide.xml
- http://www.yolinux.com/TUTORIALS/C++MemoryCorruptionAndMemoryLeaks.html
- http://www.cprogramming.com/tutorial/memory_debugging_parallel_inspector.html

No comments: