Niketan Pansare's blog: January 2012

Recently I have started a pet project Iris¹ while doing background study for my recent paper.

Voice commands/response:
Like Siri, it takes user commands and can respond using either Voice or Text.

Portable:
Since it is written in Java, it can run on most OS, including Mac, Windows and Linux. I have only tested it on Mac Lion, so I welcome any beta testers using other OS. For now, some of the features like play itunes, open email, etc are only available on Mac. I plan to add these features on other OS as well.

Configurable:
Most of the features such as use of VOICE or TEXT as feedback mechanism, as well as feedback message are configurable through the configuration file ''user_scripts/iris.config". You can also modify any of the predefined apple scripts (for eg: playSong.scpt) in that folder which will be called by Iris on the voice cammand. Also, for advance users, there are ten function commands that he/she can use to perform user-specific tasks.

Easy to install and use:
All you need to do is download and extract the binary file and run following command:

 
java -Xmx200m -jar bin/Iris.jar

The above command asks Java to run the program Iris.jar (in bin folder) and also specifies that it should not use more than 200 MB of memory.

Lightweight:
The whole setup takes about 20MB.

Client-side:
The speech-to-text is done on your laptop/desktop and not on some remote server. This means you can use it even if have no internet connection. This also means that there is no possibility of some big brother monitoring your voice commands.

Open source:
You are free to download and modify the source. As I am working towards a paper deadline, I am looking for collaborators that would help me with following features:

User-friendly: Improving voice commands
More accuracy: Expanding dictionary to include more words and also to add different 'user accents' for a given word.
Expanding grammar to include arbitrary spoken words.
Apple scripts to accessing contacts and recognizing contact names to assist users with various functionality.
Adding GUI interface rather than command line. Since all the responses go through OutputUtility.java, this is going to be easy.
New features: Support for RSS, Top News, Weather, Popular city names, Dictionary/Theasurus, Web/Wikipedia search, Dictation (to text/messages/events/todo item/alarm/reminders), Jokes, Motivational quotes, More Personalization.

If you have suggestions for improvement or want to collaborate with me, email me at 'NiketanPansare AT GMail'.

Iris uses CMU's Sphinx4 for recognizing speech and FreeTTS for text-to-speech conversion.

References:
¹ Iris is name of my friend's dog (well, no pun intended when I use the term 'pet project'). It is also the anagram of the word 'Siri' (a popular Apple's software for IPhone 4s).

Recently I had a memory bug that took me 2 days to figure out. Valgrind gave no error message (well, to be honest, it did give out a warning that got lost in my log messages). gdb pointed to completely random place that led me to believe STL's stringstream was the culprit. Since there was no way I could have definitively say it's not the cause, I couldn't rule out that possibility.

My impatience got the best of me and I foolishly rewrote that part of code using a less elegant solution using arrays. Now gdb started pointing to STL's vector.clear() method and believe it or not I was using std::string to store in it, not some user defined class …. I had it … there is no way in the world I am going to believe that gdb was right (thanks to my confidence in Effective C++ and std::vector < str::string >). But, not knowing the bug in a 'single threaded' 'command-line based' C++ code with access to gdb and pretty printing was driving me crazy. My previous blogpost didn't help either :(

So, on the same evening, armed with my music player and hot cup of tea, I decided to ignore the bug and beautify my code with better comments … geek alert !!! That's when it hit me, due to my recent modification, I was writing past the array bounds in a for loop. The funny thing is it is one of the few memory bugs that valgrind does not report in its final assessment and my best guess is that I was writing over vector's header location (and vector only complains about that during its next operation (in my case was clear()) ... which for all intensive purposes is random).

Embarrassed by this stupid mistake that wasted my two days, I felt humbled to revisit my C++ work-flow.

So, here is a sample workflow:
- svn update
- Code using your favorite editor (Make sure you configure it to conform to your style rules).
- Set DEBUGFLAGS in your makefile that does static analysis (make my_proj).
- Test it on local machines using 'make my_test'
- Run valgrind if necessary and check against below given valgrind checklist
- make style
- make doc
- svn update
- svn commit -m "some meaningful message"

The key principle is:

Automate whenever possible (with emphasis on reusability)

Never ever start coding without an SVN (or cvs or git, etc.). It will save you lot of headaches if your machine crashes or if you want to revert back to a previous version.
If the project has multiple coders/testers/collaborators, it helps if you have some kind of project-management tool like trac or jira. Also, if you are going to release your code to public, you might want to consider project hosting sites like Google Code or source forge.
Though this might seem obvious, don't host the svn server on personal machine or at least make sure to take regular backups if you do.

Use a Makefile that takes care of all necessary tasks (compilation, cleaning, data generation, document generation, code beautification, etc) and a user-friendly README file. (Advance coders can consider automate). Here is a sample Makefile of my current project:

 
CC = g++
DEBUGFLAGS = -g -Wall -Wextra
MINUITFLAGS = `root-config --cflags` `root-config --libs` -lMinuit2 -fopenmp
LIBS = -lgsl -lgslcblas -lm -lstdc++

HEADER = Helper.h MathHelper.h LDA.h Configuration.h
SRC = main.cpp MathHelper.cpp Helper.cpp LDA.cpp
OUTPUT = spoken_web

spoken_web:     $(SRC) $(HEADER)
        $(CC) $(SRC) -o $(OUTPUT) $(DEBUGFLAGS) $(LIBS)

doc:    $(HEADER) spokenweb_doxygen_configuration
        doxygen spokenweb_doxygen_configuration 

r_data: GenerateData.R
        R --slave < GenerateData.R

style: $(SRC) $(HEADER)
        astyle --options=spokenweb_astyle_rules $(SRC) $(HEADER)

clean:
        rm -f *.o $(OUTPUT) *~ *.data parameters.txt metaFile.txt *.data

Use a document generator like doxygen. It forces you to write much cleaner documentation and more importantly cleaner interface. Trust me it will help if you had to reuse your code in future.
```
make doc
```
Use code beautifier like astyle that not only shields you against inelegant colleague/coder, but also personalizes the code if you like to do 'Code Reading' (Don't you clean gym equipment after workout, well, why not do the same with your code). Here is sample style rules that I love:
```
 
--style=java
--indent=spaces=2
--indent-classes
--indent-switches
--indent-preprocessor
--indent-col1-comments
--min-conditional-indent=0
--max-instatement-indent=80
--unpad-paren
--break-closing-brackets
--add-brackets
--keep-one-line-blocks
--keep-one-line-statements
--convert-tabs
--align-pointer=type
--align-reference=type
--lineend=linux
--suffix=none
```
If your company follows other style guidelines or if you don't want to change existing formatting, remove '--suffix=none' from the style rules and once you are done, you can restore the original unformatted files (*.orig).
```
make style
```
Familiarize yourself with program analyzers and debugging tools, and have scripts that automate them in your coding process:
- g++'s flags (See DEBUGFLAGS in my makefile)
- Static analysis tools like cppcheck (For more cases, I am satisfied with the above g++'s flags)
- Runtime memory analyzer tool like valgrind. (Please see below given Valgrind checklist, and also my previous blogpost).
- gdb: For stepping through the code for more comprehensive debugging
There are several paid tools that do much better job of analyzing your code that those mentioned above. There are
I like to follow standard directory structure:
./ Makefile and configure scripts
./src cc , h files (Use subdirectories and namespaces if necessary)
./bin build directory
./data data required for testing/running the code
./test test drivers
./doc documentation generated by doxygen

Maintain a list of code snippets (stored in Evernote or a personal notebook) or helper classes that you might have to use again. Note that the helper classes should have good interface and must be well documented (and robust). For example, I use GSL for writing sampling functions for bayesian programs, which has very ugly interface and requires some amount of book-keeping. So, I have coded a wrapper class that gives an R like interface which makes my code extremely readable, less error prone and I don't need to worry about any book-keeping or other numerical issues while calling them.

 
/// \defgroup StatDistribution R-like API for family of probability distributions
/// Naming convention:
/// I have followed the naming conventions of R programming language. 
/// Please see its documentation for further details
///
/// Here is a brief intro:
/// Every distribution that R handles has four functions. There is a root name, 
/// for example, the root name for the normal distribution is norm. 
/// This root is prefixed by one of the letters
///   - d for "density", the density function
///   - r for "random", a random variable having the specified distribution
/// Example: For the normal distribution, these functions are dnorm, and rnorm.
///
/// Important note: Since most bayesian calculation is done in log space, (almost) 
/// every density function is capable of outputing 'numerically robust' logarithmic value. 
/// Hence, use dnorm(x, mean, sd, true) in your code rather than log(dnorm(x, mean, sd))
///
/// Suported continuous distribution:
/// - Uniform (unif - r,d)
/// - Univariate Normal (norm - r, d); Truncated normal (tnorm - r,d);
/// - Multivariate Normal (mnorm - r, d, conditional_r, conditional_rt) (--> Only m = 2, 3 supported)
/// - Beta (beta - r,d)
/// - Inverse Gamma (invgamma - r,d)
///
/// Suported discrete distribution:
/// - Discrete Uniform (dunif - r,d)
/// - Dirichlet (dirichlet - r,d)
/// - Multinomial (multinom - r,d)
/// - Poisson (pois - r,d)
/// @{

/// @name Continuous Uniform Distribution
///@{
/// Returns uniform variable [min, max)
double runif(double min, double max);
double dunif(double x, double min, double max, bool returnLog = false);
///@}

/// @name Discrete Uniform Distribution
///@{
/// Returns uniform variable [min, max) --> i.e. [min, max-1]
int rdunif(int min, int max);
double ddunif(int x, int min, int max, bool returnLog = false);
///@}

/// @name Univariate Normal Distribution
///@{
double rnorm(double mean, double sd);
double dnorm(double x, double mean, double sd, bool returnLog = false);
///@}

/// @name Multinomial distribution
/// Both prob and retVal should be of length sizeOfProb
/// size, say N, specifying the total number of objects that are put into 
/// K (or sizeOfProb) boxes in the typical multinomial experiment.
/// If the array prob is not normalized then its entries will be treated as 
/// weights and normalized appropriately. Furthermore, this function also allows 
/// you to provide log probabilities (in which case, you need to set isProbInLogSpace to true)
/// Eg: double prob[3] = {0.1, 0.2, 0.7}; double retVal[3]; rmultinom(1, 3, prob, retVal, false);
///@{
void rmultinom(int size, int sizeOfProb, double* prob, unsigned int* retVal, bool isProbInLogSpace = false);
double dmultinom(int sizeOfProb, double* prob, unsigned int* x, bool isProbInLogSpace = false, bool returnLog = false);
///@}

So to sample N(10,2), instead of having code like:

 
gsl_rng* rng_obj = gsl_rng_alloc(gsl_rng_mt19937);
double sampledVal = gsl_ran_gaussian(rng_obj, 2) + 10;
gsl_rng_free(rng_obj);

I have to use following code (which if you have coded in R is more readable):

 
StatisticsHelper myStatHelper;
double sampledVal = myStatHelper.rnorm(10, 2);

Valgrind checklist:

First run valgrind's memcheck tool (valgrind --tool=memcheck program_name) and check:

The heap summary (only returns memory leaks, not memory corruption errors) Eg:
```
 
==19691== HEAP SUMMARY:
==19691==     in use at exit: 0 bytes in 0 blocks
==19691==   total heap usage: 12,126 allocs, 12,126 frees, 938,522 bytes allocated
==19691== 
==19691== All heap blocks were freed -- no leaks are possible
```
If number of allocs and the number of frees will differ in o/p, the try
valgrind --tool=memcheck --leak-check=yes --show-reachable=yes program_name

Steps to avoid memory leaks:
- If you are assigning pointers to other pointers, be very explicit as to who is responsible for the deletion of the memory.
- Delete every object created by new (especially in case of exceptions).
Intermediate warning/error messages irrespective of the heap summary. Eg:
```
 
==15633== Invalid write of size 4
==15633==    at 0x4163F3: LDAGibbsSampler::FillInSpeechToTextOutputs() (LDA.cpp:206)
==15633==    by 0x41565A: LDAGibbsSampler::LDAGibbsSampler(int, int, double, double, std::string, double, double, int) (LDA.cpp:28)
==15633==    by 0x402E26: main (main.cpp:34)
```
Note, sometime one error can propagate and throw error messages at different places, so don't panic.

The intermediate error messages can detect:
- Illegal read / Illegal write errors (like the message 'Invalid read of size 4' above): It can mean any of the memory corruptions given below. If the error message is not informative, run valgrind again with '--read-var-info=yes' flag.
  Common memory corruption bugs like these are:
  - Writing/Accessing out of bounds memory (either directly or by array oriented functions like strcpy)
  - Using an unallocated object
```
 
my_class* obj_ptr;
obj_ptr -> var1 = val; // or val = obj_ptr -> var1;
```
  - Using a freed object
  Note: Memcheck (tool of valgrind) does not perform bounds checking for stack or global arrays, so even if you don't get this error/warning message, it does not mean that there is no memory corruption errors.
- Use of uninitialised values (either on stack or heap) in user function: Message looks like 'Conditional jump or move depends on uninitialised value(s)'. To see information on the sources of uninitialised data in your program, use the '--track-origins=yes' option.
- Use of uninitialised values in system calls: Message looks like 'Syscall param write(buf) points to uninitialised byte(s)'.
- Illegal free/deletions: (Invalid free())
  - Calling free() (or delete) on the object that is already freed (or deleted).
  - Deleting object that is created on stack.
- Mismatched operators: (Mismatched free() / delete / delete [])
  - Calling free() on the object that is created by new operator.
  - Calling delete on the object that is created by malloc().
  - Calling delete[] on the object that is created by the new operator.
- Overlapping source and destination blocks: (Source and destination overlap in ...)
  - Look out functions like memcpy, strcpy, strncpy, strcat and strncat.

References:
- http://google-styleguide.googlecode.com/svn/trunk/cppguide.xml
- http://www.yolinux.com/TUTORIALS/C++MemoryCorruptionAndMemoryLeaks.html
- http://www.cprogramming.com/tutorial/memory_debugging_parallel_inspector.html

Niketan Pansare's blog

Tuesday, January 17, 2012

Iris - siri like software

Monday, January 02, 2012

C++ workflow

Automate whenever possible (with emphasis on reusability)

Valgrind checklist: