My Research Diaries

Sunday, November 25, 2012

TREC tracks and their meanings: A high level overview

Confusion Track:

Study Impact of corruption on retrieval of known items. The corpus contains 55,600 documents in 3 different versions. First version is the true text and these second and third version caused by 5% and 20% degradation of the original documents. There are 49 queries for which retrievals are measured. The task to perform data cleaning in order to get MAP as close to that obtain with first version of the corpus.

Blog Track:

There are multiple tasks and sub-tasks in this track of TREC. These include blog distillation and then opinion polarity. There are about 100,000 blogs in this dataset and 50 queries for which opinion polarity was provided as ground truth. These opinions are categorized into (relevant, not relevant, negative, positive, mixed)

Enterprise Track:

The goal of this track is to study interactions within an organization mainly through email discussions, intranet pages and document repositories. W3C mailing lists were mined to study two main tasks: Email discussion search and Expert Search. In the email discussion search task, the email discussions are mined for opinion. The focus was topics for which different people had conflicting opinions. The second task (expert search) returns a ranking of users as candidates for experts on a topic. There were 198K emails discussions mined and 50 queries were used to evaluate the expert-search task.

Entity Track:

The task here is to extract not just documents but documents related to the query. The target entity is specified by the user as a part of the query. The dataset used is a subset of the ClueWeb09 dataset containing 50 million pages. The very first year of this track, 20 queries were accessed. In subsequent years 50 more queries were added.

Genomics Track:

DNA and RNA sequences (genomes and proteomes) are indexed by the NCBI (National Center for BioTechnology Information) and each of these gene functions are linked to other publicly available medical datasets (as mentioned below) by means of locuslink database:

Medline (Documenting the first discovery of that gene function)
GenBank (containing nucleotide sequence)
Online Mednelian Impact on Man (OMIM) (diseases these gene functions may cause).

Locuslink also contains GeneRIF (Gene Reference Into Function). This links the gene function with an article in Medline along with a short textual description. These serve as psuedo relevance judgements for ad-hoc IR task

The query is the short textual description for that gene function
The medline documents linked is the pseudo- relevant set for that gene function

The dataset consists of 525K docs, 50 of these queries were used for training and 50 for testing.

Legal Track

This track was first started in 2006 by researchers at the Complex Document Information Processing at IIT Chicago and they called the track the IIT CDP version 1.0. The mining was carried out on the only publicly available legal documents released as a part of the Master Settlement Agreement. These legal documents contained lawsuits filed against tobacco companies and other health-related issues. These legal track documents were scanned and then OCRed (Optical Character Recognition) by team of researchers at Univ of Southern California creating a large (1.5TB) dataset . The IIT Chicago team of researchers extracted documents from this set amount to 7 million documents. A team of lawyers working for Sedona Conference created hypothetical complaints falling in the following 5 categories 1) investigation into a campaign 2) Consumer protection lawsuit 3) Product liability 4) insider trading 5) anti trust lawsuits. There are in all 43 queries created, for which the relevance judgments were populated by pooling initial results from 6 research teams.

More to come..

Monday, November 5, 2012

In much need of inspiration

Nora Denzel's talk at Grace Hopper 2012 was very inspiring. Here are the key ideas from her talk.

I remember this everyday during this last trying year of my PhD.

Thanks Norah. You are an inspiration and I owe my PhD to you.

Thursday, November 1, 2012

Quick-Reference: How to combine multiple pdfs in Ubuntu using command-line

sudo apt-get install gs pdftk

Use

gs -dNOPAUSE -sDEVICE=pdfwrite -sOUTPUTFILE=combinedpdf.pdf -dBATCH 1.pdf 2.pdf 3.pdf

I must admit this is not an original finding. Please refer to to This Post for the original instruction list. I am basically putting it down here for my quick-reference

Wednesday, October 17, 2012

Adjusting floatfraction

If you need to re-adjust the float fraction of your latex documents.

I got this code snippet from this webpage,

Just documented it here in order to save time for future references

% Alter some LaTeX defaults for better treatment of figures:
    % See p.105 of "TeX Unbound" for suggested values.
    % See pp. 199-200 of Lamport's "LaTeX" book for details.
    %   General parameters, for ALL pages:
    \renewcommand{\topfraction}{0.9} % max fraction of floats at top
    \renewcommand{\bottomfraction}{0.8} % max fraction of floats at bottom
    %   Parameters for TEXT pages (not float pages):
    \setcounter{topnumber}{2}
    \setcounter{bottomnumber}{2}
    \setcounter{totalnumber}{4}     % 2 may work better
    \setcounter{dbltopnumber}{2}    % for 2-column pages
    \renewcommand{\dbltopfraction}{0.9} % fit big float above 2-col. text
    \renewcommand{\textfraction}{0.07} % allow minimal text w. figs
    %   Parameters for FLOAT pages (not text pages):
    \renewcommand{\floatpagefraction}{0.7} % require fuller float pages
 % N.B.: floatpagefraction MUST be less than topfraction !!
    \renewcommand{\dblfloatpagefraction}{0.7} % require fuller float pages

 % remember to use [htp] or [htpb] for placement

Saturday, September 29, 2012

Poster creation quick tip

So I have been struggling with creating posters for a conference I am attending next week. The beamer is definitely the way to go but creating a poster in beamer is painful. So here is a work-around.

Create the slides in beamer. If you have an A0 poster then roughly 16 slides or less
Create a pdf of the slides
Convert pdf to high resolution jpeg images
gs -dNOPAUSE -dBATCH -sDEVICE=jpeg -r2100 -sOutputFile='page-d.jpg' Flyer.pdf

Import these images in Open Office and create a gigantic poster in Open Office.

Thursday, September 6, 2012

Randomized algorithm for Verification of Matrix Multiplication

Suppose you want to verify if AB = C and A is m by n and B is n by k then, the time taken for this is O(mnk + nk) = O(mnk)

What if you had a vector x, then ABx = Cx... here x is a random vector.
x is of dimensionality k by 1 and created by randomly sampling from a gaussian distribution.

Now the time taken to compute Bx is O(nk) and is of dimentionaity n by 1 and time taken to compute ABx is O(mn). Similarly, time taken to compute Cx is O(mk)...

So the total time taken to verify is O(mn+nk+mk) and not O(mnk)

Thursday, July 26, 2012

The hashing trick for a dynamically changing vocabulary

I recently learned about the "hashing" trick in Machine Learning. It is typically used to handle dynamically changing vocabulary in large-scale machine learning algorithms. With the hashing trick, we always have a "fixed" vocabulary. Only thing is what this vocabulary is, we dont know. The words are hashed into a M- length hash table, and no matter how many new words come in the hash contains all of these tables. Its amazing that Yahoo research is the company that came up with the idea and implemented a widely used open source software called the vowpal wabbit. Researchers who propose new online algorithms with a dynamically changing feature set implement their algorithm in vowpal Wabbit. I think there is also an effort to implement vowpal wabbit on top of hadoop.

It is indeed sad that a company like Yahoo which has been an innovator of so many cool ideas is under the weather. Hopefully Marissa Mayer will turn things around.

For more information on Vowpal Wabbit visit https://github.com/JohnLangford/vowpal_wabbit/wiki/Examples