My Research Diaries

Thursday, February 18, 2010

Error......

Tuesday, February 9, 2010

Targetting discrimination as opposed to similarity

Clustering mechanism with data that is expected to follow continuous distributions use an objective function to measure the optimum clustering of the documents. Such an objective function is usually a ratio of two quantities : inter-cluster distance to intra-cluster distance. Such a ratio measures the power of the clustering algorithm to bring together similar data samples and separate apart the dissimilar data samples. Fortunately for Gaussian models, this works out into a nice clean equation. No wonder people LOVE the Gaussian model.

When we are dealing with discrete data, the most likely distribution is the multinomial distribution. To add more control over the behavior of the model one could use dirichlet priors. The document clustering approaches in the discrete domain merely use the EM or the gibbs sampling approach to fit data optimally to the discrete model (especially latent mixture model). There is no underlying objective function that says, this discrete (latent) model should also be able to optimally sepearte out one topic from the other. Thus one could end up with 2 topics generating the exact same set of words.

This is definitely something that is lacking in the EM algorithms for fitting discrete data....
Maybe there has been work done on this already. But I am yet to discover it

Friday, January 29, 2010

TIME corpus... blah!!!

The TIME corpus is the most useless and "wrong" corpus that can ever be.

A little analysis on the corpus reveals that the corpus is built off two main topics, and the rest of the topics are under-represented.
The number of words is really high as compared to the number of documents, thats why the representation itself is very sparse.
Due to the high ratio of (Numof words) to (Num of documents) it is impossible to train a topic model on this corpus and expect it to represent such a corpus well
the qrels (relevance judgements) are INCORRECT. A visual exploration by comparing words reveal no similarity (at least on term-term basis) between the queries and the documents marked relevant vis-a-vis the queries.

So if you want to use TIME corpus, use it at your own RISK

Friday, January 8, 2010

The trouble with testing IRS

Testing for an IRS can get so freaking hard. The vocabulary sizes can run upto a few tens of thousands. That means you are dealing with a very very very high dimensional space. The term weights or term probabilities are extremely small. So small that you are dealing with values in log scale only. The relevance judgements for some queries are not reliable. The queries dig up some documents that might seem relevant to you (as a programmer/user) but the standardized judgements have not tagged the documents. This can get increasingly frustrating because you might be led to believe that there is a bug in your code. Changes made to the query or the document in response to relevance feedback is hard to interpret and understand, when dealing with few tens of thousands of words and tens of thousands of documents.

To circumvent this problem, it is essential to first create your own teeny tiny corpora of few tens of documents and few tens of words. If you wish to build topic models over your corpora make sure that your document-term matrix is tall and thin (not sparse). If you are using the basic Unigram and VSM kind of approaches, a short and stout (sparse) matrix might do the magic.
The relevance judgements should be created by you by manual inspection. To start out build a few topics with a few words that are very very distinct even to a human :). Use these topics to generate documents using the LDA document generation model.

This has proven useful to my experimentation and explorations and I hope the reader of this blog will find it useful too.

Monday, December 21, 2009

Multiplication of matrices in log domain

Given two matrices A and B, whose entries are specified in log. i.e. the ith row and jth column has log(aij). If you take the exponent of these values the numbers may just disappear (exp(-200) is approximately 0) or explode (exp(200) is approximately infinity).

So multiplying them by first taking the exponent of their entries is out of question. How to efficiently perform matrix multiplication in such cases. There are two pieces to the jigsaw puzzle.

log(a*b) = log(a)+ log(b)
log(a+b) ~ max(log(a),log(b)) + k* exp(-abs(log(a)-log(b)))
k = 0.77

the second formula is called the "Chad's approximation formula"

How to use these? Crank at your computer a little

Friday, December 4, 2009

The reality about academia

http://sciencecareers.sciencemag.org/career_magazine/previous_issues/articles/2009_11_13/caredit.a0900141

Monday, November 23, 2009

First Acceptance of failure

I feel really let down. I feel like a wastrel. I have delayed my efforts. I feel self doubted.
I spent almost 8-9 months, on two papers (for which I was the second author) and coding in C/C++ for even proof of Idea. The result was, I ended up being a programmer :(

I could have achieved so much more. I can at least confess that the past 6 months have been a super unproductive time. I feel like I was a short-sighted amateur researcher, who could not have known better.

Hopefully the next 6 weeks, before new years will be at least useful and productive in terms of experimentation.

Here goes my nights of sleep and peace of mind.
I won't be able to rest till I see the light at the end of this tunnel.

Will I be able to meet SIGIR deadline? God only knows