Wednesday, March 25, 2009

Language models for Information Retrieval

Although there are innumerable number of papers that claim to use language models to represent document for information retrieval purposes, I am beginning to doubt the credibility of such research.

There are n number of toolkits and libraries developed for language modeling, example mallet,NLTK etc. Similarly there are n number of toolkits and libraries for Information retrieval lemur, dragon etc. None of the NLP toolkits have any functionality developed so as to yield itself to IR applications, the same is true with IR tools.

Either the tools developed by the research that is published is held secretive or the time taken for retrieval by language model based IR is so high that it renders itself useless, when it comes to created commercial/open source software and is meant purely for academic purpose.

Digging deeper into the retrieval procedure based on query and based on the query model, corroborates the claim made earlier. Simplified language models like unigram and vector space, render themselves useful for high speed retrieval applications, due to the use of inverted list when the query lengths are relatively short. On the other hand, if a language model for the query document is to be created, it requires smoothing and the model representation may not be accurate to start with. In addition, the query representation created by this model maybe longer than the actual query. Last but not the least, we now need to compare the query representation with each and every document representation and compute the score. This was not the case in inverted list, where the set of documents selected was only those that contained the query words.

An indexing scheme for the topic representation of the corpus may speed up computations. Or reduced representation of documents using NLP techniques, followed by unigram model for retrieval is another plausible venue of investigation. This is the last resort if any NLP model has to be incorporated for IR.

Confusions with model parameter estimation and Sampling techniques

Lately, I have been digging in deep into Bayesian parameter estimation and how it works with the MCMC sampling techniques. Most of the tutorials leave a gap open in their efforts on explaining what role do MCMC techniques play in parameter estimation or filtering or prediction? Can the two be separated at all from a theoretical point at least? With the advent of recursive bayesian estimation the thin line that separates the two is getting smudged or erased.

Here I am trying to bring it all under one roof. The punch line is this.

"Bayesian Parameter estimation employs sampling and its only one of the steps in estimation.
When we are doing predictive distribution and filtering sampling becomes an important step in calculating the predicted and/or values and their corresponding distribution."

Some of the main techniques in parameter estimation are the EM, MAP, variational EM, stochastic EM and particle filters etc

Sampling techniques involve MCMC, Gibbs, MH, Importance sampling, sequential importance sampling.

Sample->estimate parameters->sample->estimate parameters

This cycle goes on till we have reached optimal parameter values

Content based filtering vs Information Retreival

From the outside, it may seem that CBF and IR do essentially the same thing. Given a set of documents (corpus/database) and the user's request (or profile), suggest(retrieve) some other documents. However there is one fundamental difference. In case of IR applications, the documents in the collection do not change. The query or the user's request however is ad-hoc and completely unpredictable. In case of CBF, the documents change all the time, but the profile or kind of questions asked to the database is more or less stable.

Given this, it would be interesting to ask, what is more relevant as an application to semantic searches in software corpora? a CBF of IR?

For one, the software corpora is ever changing, with people adding new code, deleting or modifying old code. Secondly, the programmer's requests to a software corpora would be inherently come from a limited pool of questions.

Need I say more?