My Research Diaries: TIME corpus... blah!!!

Friday, January 29, 2010

TIME corpus... blah!!!

The TIME corpus is the most useless and "wrong" corpus that can ever be.

A little analysis on the corpus reveals that the corpus is built off two main topics, and the rest of the topics are under-represented.
The number of words is really high as compared to the number of documents, thats why the representation itself is very sparse.
Due to the high ratio of (Numof words) to (Num of documents) it is impossible to train a topic model on this corpus and expect it to represent such a corpus well
the qrels (relevance judgements) are INCORRECT. A visual exploration by comparing words reveal no similarity (at least on term-term basis) between the queries and the documents marked relevant vis-a-vis the queries.

So if you want to use TIME corpus, use it at your own RISK

No comments:

Post a Comment

Subscribe to: Post Comments (Atom)