Monday, December 21, 2009

Multiplication of matrices in log domain

Given two matrices A and B, whose entries are specified in log. i.e. the ith row and jth column has log(aij). If you take the exponent of these values the numbers may just disappear (exp(-200) is approximately 0) or explode (exp(200) is approximately infinity).

So multiplying them by first taking the exponent of their entries is out of question. How to efficiently perform matrix multiplication in such cases. There are two pieces to the jigsaw puzzle.

log(a*b) = log(a)+ log(b)
log(a+b) ~ max(log(a),log(b)) + k* exp(-abs(log(a)-log(b)))
k = 0.77

the second formula is called the "Chad's approximation formula"

How to use these? Crank at your computer a little

Friday, December 4, 2009

The reality about academia

Monday, November 23, 2009

First Acceptance of failure

I feel really let down. I feel like a wastrel. I have delayed my efforts. I feel self doubted.
I spent almost 8-9 months, on two papers (for which I was the second author) and coding in C/C++ for even proof of Idea. The result was, I ended up being a programmer :(

I could have achieved so much more. I can at least confess that the past 6 months have been a super unproductive time. I feel like I was a short-sighted amateur researcher, who could not have known better.

Hopefully the next 6 weeks, before new years will be at least useful and productive in terms of experimentation.

Here goes my nights of sleep and peace of mind.
I won't be able to rest till I see the light at the end of this tunnel.

Will I be able to meet SIGIR deadline? God only knows

Friday, October 23, 2009

Perl tutorial parsing SGML files using HTML parsers

I have not come across a useful tutorial for starters of HTML parsing using perl.

Step 1. Create a subclass

package MyParser;
use base qw(HTML::Parser);

Step 2. Create a list of global variables


you can add extra $origtext depending on where your text and attributes are embedded

Step 3. Define 3 functions that are going to override the three subs as follows

sub text{
($self, $origtext1, $is_cdata) = @_;

sub start {

($self, $tagname, $attr, $attrseq, $origtext2) = @_;
sub end {


Step 4. Each of the subs are nothing but event handlers. so if you want to detect any tag and anything in between use
if($tagname eq ''){

in each of the subs the start and end would probably set and reset a flag to detect start and end of that particular <'tag_to_detect'>

Step 5. Write the code that creates the object and calls these files.

my $fname = '';
my $parser = MyParser->new;

An Example file, an example code

file doc.sgml


Stock Exchange wall street traders dollar market financial crisis public private banks investment funds regulators corporation board owenership decline competitors share broker buyers seller american capital globalize money


Iraq Afghanistan Soldier battle field combat zone wounded amputees blinded veteran base lieutenant explosion deploy force troops american british pakistan strategy civilian security general war commanders operations helicopters allies military death european army taliban terrorist islam extremist NATO Karzai al aaqeda brigade sergeant base injury bombs killed violence fatalities


#!/usr/bin/perl -w
use strict;
package MyParser;
use base qw(HTML::Parser);
# two kinds of text in this document I want to extract
my $flagdocno=0; # tag for first text
my $flagtext=0; # tag for second text
# we use these three variables to count something

# here HTML::text/start/end are overridden
sub text {
($self, $origtext1, $is_cdata) = @_;
if($tagname eq 'docno'){
print "$origtext1->\t";
if($tagname eq 'doc'){
print "$origtext1->\t";
sub start {

($self, $tagname, $attr, $attrseq, $origtext2) = @_;
if($tagname eq 'docno'){
print "found topicno\t";
if($tagname eq 'doc'){
print "found text\t";
#print "$tagname\t";

sub end {
if($tagname eq 'docno'){
print "ending topicno\n";
if($tagname eq 'doc'){
print "ending text\n";

package main;

my $fname = '/media/windows/shivani/reserach/database/myown_test/topics.sgml';
my $parser = MyParser->new;

Hopefully it is useful to somebody who is a starter just like me

Tuesday, September 29, 2009

The story const member functions

C++ is strange in its own ways. This one beats the rest....

C++ allows a class to have const functions

the declaration would typically look like this

::() const

this function cannot modify the data members of the object of the class

But here is a workaround, call the data members "mutable" and get a back door entry to modifying the elements of the object

C++ programming gets as interesting as it can....

Thursday, August 20, 2009

Lemur Toolkit .... A shining example of bad programming practices

1. How to create a function that create memory and tell the caller to delete it (using comments in the code)
2. How to make member function "const" functions and to allow them to modify member attributes, call them "mutable"
3. How to use dynamic cast to see what object is coming in
4. How to contain a class within the other so as to "hide" some useful functionalities. For example, contain a list class and hide the "add" functionality

This list is bound to grow

Saturday, May 16, 2009

A sweet discovery, returning or passing pointers

IN C/c++ programs a common thing is to return values by address

double * return_doubleval()
double *a = new double
*a = 1;
return a;

as opposed to

void return_double(double *a)
if (a==NULL)
a = new double ;
*a =1;

because the second idea does not work
we need to send in double **a and do (*a) = new double if we want the pointer address also to be updated in the calling function

BTW, the idea of allocating memory and having the caller delete the memory is a bad bad idea
nevertheless, sometimes we dont know the length of the array we want to allocate and only the function that is called knows an answer to it, and this becomes mandatory

Wednesday, May 13, 2009

Model update based RF for text retrieval

Document Reformulation has been gaining much attention among researchers, especially because query reformulation does not help the underlying model learn and retrain from the user feedback.

One interesting paper that has caught my attention is the one by Rajapakse and Denham (Text Retrieval with more realistic concept matching and reinforcement learning). Not only did I find the work very refreshing and have amazing parallel to the standard mainstream topic modeling based information retrieval approaches, I found the following key contributions made by the paper:

1) If a model has to learn from user feedback and return with improved results it has to be real time. A real time document model has to be distributed and not centralized. Thus feedback regarding a document vis-a-vis a query affects that particular document model leaving others unaffected. This is a major change from the standardized topic modeling based approaches where the topic distributions are central to the entire corpus.
2) The model rewards and penalizes unlike only rewarding in most of the standard mainstream query reformulation techniques. Hence learning new things and ability to forget is exhibited by the model
3) The RF phase is also treated as training

Like most research several issues are left to be esoteric and they are listed as follows
1) No explanation is given on how exactly the concepts are extracted for each document and how the concept lattice is built
2) there are several parameters to tune, viz. weights to object-attribute pairs, learning rate of the weight parameter
3)The testing strategy is an independent one

Testing strategy
1) Divide the query set into two sets train and test
2) The training set is again divided into 4 subsets
3) Start training the document with 20 iterations of RF for each query in each subset.

The only big issue remains, on how the initial concept lattices are built and how crucial is their quality on performance in later stages of retrieval process.

Tuesday, April 21, 2009

Bayesian Unlearning

An interesting problem in Dynamic mixture model is to be able to see how we can accommodate unlearning and if it can be done by a 'particle-filtering' like on-line approach

Wednesday, March 25, 2009

Language models for Information Retrieval

Although there are innumerable number of papers that claim to use language models to represent document for information retrieval purposes, I am beginning to doubt the credibility of such research.

There are n number of toolkits and libraries developed for language modeling, example mallet,NLTK etc. Similarly there are n number of toolkits and libraries for Information retrieval lemur, dragon etc. None of the NLP toolkits have any functionality developed so as to yield itself to IR applications, the same is true with IR tools.

Either the tools developed by the research that is published is held secretive or the time taken for retrieval by language model based IR is so high that it renders itself useless, when it comes to created commercial/open source software and is meant purely for academic purpose.

Digging deeper into the retrieval procedure based on query and based on the query model, corroborates the claim made earlier. Simplified language models like unigram and vector space, render themselves useful for high speed retrieval applications, due to the use of inverted list when the query lengths are relatively short. On the other hand, if a language model for the query document is to be created, it requires smoothing and the model representation may not be accurate to start with. In addition, the query representation created by this model maybe longer than the actual query. Last but not the least, we now need to compare the query representation with each and every document representation and compute the score. This was not the case in inverted list, where the set of documents selected was only those that contained the query words.

An indexing scheme for the topic representation of the corpus may speed up computations. Or reduced representation of documents using NLP techniques, followed by unigram model for retrieval is another plausible venue of investigation. This is the last resort if any NLP model has to be incorporated for IR.

Confusions with model parameter estimation and Sampling techniques

Lately, I have been digging in deep into Bayesian parameter estimation and how it works with the MCMC sampling techniques. Most of the tutorials leave a gap open in their efforts on explaining what role do MCMC techniques play in parameter estimation or filtering or prediction? Can the two be separated at all from a theoretical point at least? With the advent of recursive bayesian estimation the thin line that separates the two is getting smudged or erased.

Here I am trying to bring it all under one roof. The punch line is this.

"Bayesian Parameter estimation employs sampling and its only one of the steps in estimation.
When we are doing predictive distribution and filtering sampling becomes an important step in calculating the predicted and/or values and their corresponding distribution."

Some of the main techniques in parameter estimation are the EM, MAP, variational EM, stochastic EM and particle filters etc

Sampling techniques involve MCMC, Gibbs, MH, Importance sampling, sequential importance sampling.

Sample->estimate parameters->sample->estimate parameters

This cycle goes on till we have reached optimal parameter values

Content based filtering vs Information Retreival

From the outside, it may seem that CBF and IR do essentially the same thing. Given a set of documents (corpus/database) and the user's request (or profile), suggest(retrieve) some other documents. However there is one fundamental difference. In case of IR applications, the documents in the collection do not change. The query or the user's request however is ad-hoc and completely unpredictable. In case of CBF, the documents change all the time, but the profile or kind of questions asked to the database is more or less stable.

Given this, it would be interesting to ask, what is more relevant as an application to semantic searches in software corpora? a CBF of IR?

For one, the software corpora is ever changing, with people adding new code, deleting or modifying old code. Secondly, the programmer's requests to a software corpora would be inherently come from a limited pool of questions.

Need I say more?