Confusion Track:Study Impact of corruption on retrieval of known items. The corpus contains 55,600 documents in 3 different versions. First version is the true text and these second and third version caused by 5% and 20% degradation of the original documents. There are 49 queries for which retrievals are measured. The task to perform data cleaning in order to get MAP as close to that obtain with first version of the corpus.
There are multiple tasks and sub-tasks in this track of TREC. These include blog distillation and then opinion polarity. There are about 100,000 blogs in this dataset and 50 queries for which opinion polarity was provided as ground truth. These opinions are categorized into (relevant, not relevant, negative, positive, mixed)
The goal of this track is to study interactions within an organization mainly through email discussions, intranet pages and document repositories. W3C mailing lists were mined to study two main tasks: Email discussion search and Expert Search. In the email discussion search task, the email discussions are mined for opinion. The focus was topics for which different people had conflicting opinions. The second task (expert search) returns a ranking of users as candidates for experts on a topic. There were 198K emails discussions mined and 50 queries were used to evaluate the expert-search task.
Entity Track:The task here is to extract not just documents but documents related to the query. The target entity is specified by the user as a part of the query. The dataset used is a subset of the ClueWeb09 dataset containing 50 million pages. The very first year of this track, 20 queries were accessed. In subsequent years 50 more queries were added.
DNA and RNA sequences (genomes and proteomes) are indexed by the NCBI (National Center for BioTechnology Information) and each of these gene functions are linked to other publicly available medical datasets (as mentioned below) by means of locuslink database:
- Medline (Documenting the first discovery of that gene function)
- GenBank (containing nucleotide sequence)
- Online Mednelian Impact on Man (OMIM) (diseases these gene functions may cause).
Locuslink also contains GeneRIF (Gene Reference Into Function). This links the gene function with an article in Medline along with a short textual description. These serve as psuedo relevance judgements for ad-hoc IR task
- The query is the short textual description for that gene function
- The medline documents linked is the pseudo- relevant set for that gene function
The dataset consists of 525K docs, 50 of these queries were used for training and 50 for testing.