- A little analysis on the corpus reveals that the corpus is built off two main topics, and the rest of the topics are under-represented.
- The number of words is really high as compared to the number of documents, thats why the representation itself is very sparse.
- Due to the high ratio of (Numof words) to (Num of documents) it is impossible to train a topic model on this corpus and expect it to represent such a corpus well
- the qrels (relevance judgements) are INCORRECT. A visual exploration by comparing words reveal no similarity (at least on term-term basis) between the queries and the documents marked relevant vis-a-vis the queries.
Friday, January 29, 2010
TIME corpus... blah!!!
The TIME corpus is the most useless and "wrong" corpus that can ever be.