Friday, January 29, 2010

TIME corpus... blah!!!

The TIME corpus is the most useless and "wrong" corpus that can ever be.

  • A little analysis on the corpus reveals that the corpus is built off two main topics, and the rest of the topics are under-represented.
  • The number of words is really high as compared to the number of documents, thats why the representation itself is very sparse.
  • Due to the high ratio of (Numof words) to (Num of documents) it is impossible to train a topic model on this corpus and expect it to represent such a corpus well
  • the qrels (relevance judgements) are INCORRECT. A visual exploration by comparing words reveal no similarity (at least on term-term basis) between the queries and the documents marked relevant vis-a-vis the queries.
So if you want to use TIME corpus, use it at your own RISK

No comments:

Post a Comment