Although there are innumerable number of papers that claim to use language models to represent document for information retrieval purposes, I am beginning to doubt the credibility of such research.
There are n number of toolkits and libraries developed for language modeling, example mallet,NLTK etc. Similarly there are n number of toolkits and libraries for Information retrieval lemur, dragon etc. None of the NLP toolkits have any functionality developed so as to yield itself to IR applications, the same is true with IR tools.
Either the tools developed by the research that is published is held secretive or the time taken for retrieval by language model based IR is so high that it renders itself useless, when it comes to created commercial/open source software and is meant purely for academic purpose.
Digging deeper into the retrieval procedure based on query and based on the query model, corroborates the claim made earlier. Simplified language models like unigram and vector space, render themselves useful for high speed retrieval applications, due to the use of inverted list when the query lengths are relatively short. On the other hand, if a language model for the query document is to be created, it requires smoothing and the model representation may not be accurate to start with. In addition, the query representation created by this model maybe longer than the actual query. Last but not the least, we now need to compare the query representation with each and every document representation and compute the score. This was not the case in inverted list, where the set of documents selected was only those that contained the query words.
An indexing scheme for the topic representation of the corpus may speed up computations. Or reduced representation of documents using NLP techniques, followed by unigram model for retrieval is another plausible venue of investigation. This is the last resort if any NLP model has to be incorporated for IR.