Thursday, December 27, 2012

Trying to revive the C3M algorithm

C3M is so much unlike other hierarchical  or k-means category of algorithms for clustering. The C3M (Cover-Coefficient Clustering Method) is a single-pass clustering algorithm built by researchers in IR (Information Retrieval). Hence, it is primarily meant for documents. Because IR researchers built this algorithm, they take advantage of the fact that term-document matrices are usually sparse and can be represented well by an inverted index. The main advantage of the algorithm is that it provides a way to estimate/compute the number of clusters given a document-term matrix.

 Here are some of the claims made by their paper " Concept and Effectiveness of the Cover Coefficient Based Clustering Methodology for Text Databases by Fazli Can and Esen OZkarahan".
a) Clusters are stable
b) Algorithm is independent of order of documents, hence, we will always a unique clusters
c) The memory overhead is really low
d) The algorithm distributes the documents evenly among cluster. In other words, it does not create fat clusters or singletons.

The algorithm starts of by examining the document-term matrix D of a collection of size m by n. From this it computes C ( m by m) and C' (n by n), where C[i,j] indicates the extent to which i covers j. The higher c[i,j] the higher the similarity between the two documents. Similarly, C' records how much one terms covers the other. From C and C' matrices it is possible to compute a document's seed power. The higher the seed power of a document, the higher the chance that it being selected to be a seed of a cluster.Once the cluster seeds are selected. For each document (not assigned yet) we find the closest cluster by simply examining c[i,j], where i corresponds to the unassigned document and j corresponds to each of the cluster seeds.

If m denotes the total number of documents, xd denotes the average number of words in each document and, tg denotes the average of documents in which a term appears, then the complexity of this clustering approch is O(xd X m X tg)

Creating the C matrix

initialize C[m,m] <- 0="0" p="p">for each document i
      for each term t_i in that document
                 Compute P(t_i) = D(i,t)/ sum_t {D(i,t)}
                 Get the inverted index for term t_i .. Lets call it inv(t_i)
                  compute p(d_j|inv(t_i)) for each document in the inverted list
                  for each document j in inv(t_i)
                               c[i,j] = c[i,j] + p(t_i)*p(d_j|inv(t_i))

Similarly one can compute c'

Compute number of clusters n_c = \sum_i c[i,i]

Computing Cluster seed power
for each document i
     P(i) = c[i,i]*(1-c[i,i])* sum_j  { d[i,j] c'[j,j] (1-c'[j,j])}

Pick the top n_c documents that have highest P(i) and set them as cluster seeds  {D_cluster_Seeds}

Assigning documents to cluster seeds
for each of the document i (that is not a cluster seed)
   assigned = argmax(j) {c[i,j] where j < {D_cluster_Seeds}}

Cluster Validity

This paper presents a unique way of gauging the cluster structure as it measures this in terms of the query. Given two cluster structures -one random assignment and second the cluster structure under test, a set of queries with the corresponding set of relevant documents, the cluster validity is computed as follows. Suppose a query has k relevant documents, then we compute the value $n_t$ and $n_tr$ corresponding to the test cluster structure and random clustering. The authors use $n_t$ or $n_tr$ to denote the number of target clusters. The target cluster is a cluster that contains at least 1 relevant document to the query. If the clustering is any good, then $n_t$ should be less than $n_tr$. The average of $n_t$ is computed with $n_tr$ over all the queries and may be additionally subject to significance testing.

Now let me explain how $n_t$ is computed. For a given query, the authors first find the size of the relevant set (from ground truth). Lets denote this by k. Given $n_c$ number of clusters, for each cluster, $P_i$ is computed. $P_i$ computes the probability that a cluster $C_i$ will be selected given that randomly k documents from the set of m documents were picked. There is a direct formula in the paper that computes this. Computing $P_1+P_2+...P_n_c$ yields $n_t$ for the test cluster. The procedure is repeated for the cluster structure obtained by random clustering.

Although it makes mathematical sense, I wonder what would be the actual number of target clusters for the queries based on the actual location of the relevant documents. I personally feel that this is a far better and more accurate representation of the reality. Of course, we can compare the number of target clusters with a random cluster structure or with another clustering algorithm.


Querying with different term weighting has been extensively covered in this paper and I will not be able to cover all of it here.

I am going to implement this algorithm and update this post with some results.

Monday, December 10, 2012

Quick Reference: Extract metadata information of a document

To Extract ID of a document in a corpus called myCorpus


I quickly strip the leading and trailing white spaces using this command below:


Sunday, November 25, 2012

TREC tracks and their meanings: A high level overview

Confusion Track:

Study Impact of corruption on retrieval of known items. The corpus contains 55,600 documents in 3 different versions. First version is the true text and these second and third version caused by 5% and 20% degradation of the original documents. There are 49 queries for which retrievals are measured. The task to perform data cleaning in order to get MAP as close to that obtain with first version of the corpus.

Blog Track:

There are multiple tasks and sub-tasks in this track of TREC. These include blog distillation and then opinion polarity. There are about 100,000 blogs in this dataset and 50 queries for which opinion polarity was provided as ground truth. These opinions are categorized into (relevant, not relevant, negative, positive, mixed)

Enterprise Track:

The goal of this track is to study interactions within an organization mainly through email discussions, intranet pages and document repositories. W3C mailing lists were mined to study two main tasks: Email discussion search and Expert Search. In the email discussion search task, the email discussions are mined for opinion. The focus was topics for which different people had conflicting opinions. The second task (expert search) returns a ranking of users as candidates for experts on a topic. There were 198K emails discussions mined and 50 queries were used to evaluate the expert-search task. 

Entity Track:

The task here is to extract not just documents but documents related to the query. The target entity is specified by the user as a part of the query. The dataset used is a subset of the ClueWeb09 dataset containing 50 million pages. The very first year of this track, 20 queries were accessed. In subsequent years 50 more queries were added.

Genomics Track:

DNA and RNA sequences (genomes and proteomes) are indexed by the NCBI (National Center for BioTechnology Information) and each of these gene functions are linked to other publicly available medical datasets (as mentioned below) by means of locuslink database:

  • Medline (Documenting the first discovery of that gene function)
  • GenBank (containing nucleotide sequence)
  • Online Mednelian Impact on Man (OMIM) (diseases these gene functions may cause). 

Locuslink also contains GeneRIF (Gene Reference Into Function). This links the gene function with an article in Medline along with a short textual description. These serve as psuedo relevance judgements for ad-hoc IR task

  • The query is the short textual description for that gene function
  • The medline documents linked is the pseudo- relevant set for that gene function

The dataset consists of 525K docs, 50 of these queries were used for training and 50 for testing.

Legal Track

This track was first started in 2006 by researchers at the Complex Document Information Processing at IIT Chicago and they called the track the IIT CDP version 1.0. The mining was carried out on the only publicly available legal documents released as a part of the Master Settlement Agreement. These legal documents contained lawsuits filed against tobacco companies and other health-related issues. These legal track documents were scanned and then OCRed (Optical Character Recognition) by team of researchers at Univ of Southern California creating a large (1.5TB) dataset . The IIT Chicago team of researchers extracted documents from this set amount to 7 million documents. A team of lawyers working for Sedona Conference created hypothetical complaints falling in the following 5 categories 1) investigation into a campaign 2) Consumer protection lawsuit 3) Product liability 4) insider trading 5) anti trust lawsuits. There are in all 43 queries created, for which the relevance judgments were populated by pooling initial results from 6 research teams.

More to come..

Monday, November 5, 2012

In much need of inspiration

Nora Denzel's talk at Grace Hopper 2012  was very inspiring. Here are the key ideas from her talk.

I remember this everyday during this last trying year of my PhD.

Thanks Norah. You are an inspiration and I owe my PhD to you.

Thursday, November 1, 2012

Quick-Reference: How to combine multiple pdfs in Ubuntu using command-line

sudo apt-get install gs pdftk


gs -dNOPAUSE -sDEVICE=pdfwrite -sOUTPUTFILE=combinedpdf.pdf -dBATCH 1.pdf 2.pdf 3.pdf

I must admit this is not an original finding. Please refer to to This Post  for the original instruction list. I am basically putting it down here for my quick-reference

Wednesday, October 17, 2012

Adjusting floatfraction

If you need to re-adjust the float fraction of your latex documents.
I got this code snippet from this webpage
Just documented it here in order to save time for future references
% Alter some LaTeX defaults for better treatment of figures:
    % See p.105 of "TeX Unbound" for suggested values.
    % See pp. 199-200 of Lamport's "LaTeX" book for details.
    %   General parameters, for ALL pages:
    \renewcommand{\topfraction}{0.9} % max fraction of floats at top
    \renewcommand{\bottomfraction}{0.8} % max fraction of floats at bottom
    %   Parameters for TEXT pages (not float pages):
    \setcounter{totalnumber}{4}     % 2 may work better
    \setcounter{dbltopnumber}{2}    % for 2-column pages
    \renewcommand{\dbltopfraction}{0.9} % fit big float above 2-col. text
    \renewcommand{\textfraction}{0.07} % allow minimal text w. figs
    %   Parameters for FLOAT pages (not text pages):
    \renewcommand{\floatpagefraction}{0.7} % require fuller float pages
 % N.B.: floatpagefraction MUST be less than topfraction !!
    \renewcommand{\dblfloatpagefraction}{0.7} % require fuller float pages

 % remember to use [htp] or [htpb] for placement

Saturday, September 29, 2012

Poster creation quick tip

So I have been struggling with creating posters for a conference I am attending next week. The beamer is definitely the way to go but creating a poster in beamer is painful. So here is a work-around.

Create the slides in beamer.  If you have an A0 poster then roughly 16 slides or less
Create a pdf of the slides
Convert pdf to high resolution jpeg images
 gs -dNOPAUSE -dBATCH -sDEVICE=jpeg -r2100 -sOutputFile='page-d.jpg' Flyer.pdf

Import these images in Open Office and create a gigantic poster in Open Office.

Thursday, September 6, 2012

Randomized algorithm for Verification of Matrix Multiplication

Suppose you want to verify if AB = C and A is m by n and B is n by k then, the time taken for this is O(mnk + nk) = O(mnk)

What if you had a vector x, then ABx = Cx...  here x is a random vector.
x  is of dimensionality k by 1  and created by randomly sampling from a gaussian distribution.

Now the time taken to compute Bx is O(nk)  and is of dimentionaity n by 1 and time taken to compute ABx is O(mn). Similarly, time taken to compute Cx is O(mk)...

So the total time taken to verify is O(mn+nk+mk) and not O(mnk)

Thursday, July 26, 2012

The hashing trick for a dynamically changing vocabulary

I recently learned about the "hashing" trick in Machine Learning. It is typically used to handle dynamically changing vocabulary in large-scale machine learning algorithms. With the hashing trick, we always have a "fixed" vocabulary. Only thing is what this vocabulary is, we dont know. The words are hashed into a M- length hash table, and no matter how many new words come in the hash contains all of these tables. Its amazing that Yahoo research is the company that came up with the idea and implemented a widely used open source software called the vowpal wabbit. Researchers who propose new online algorithms with a dynamically changing feature set implement their algorithm in vowpal Wabbit. I think there is also an effort to implement vowpal wabbit on top of hadoop.

It is indeed sad that a company like Yahoo which has been an innovator of so many cool ideas is under the weather. Hopefully Marissa Mayer will turn things around.

For more information on Vowpal Wabbit visit

Tuesday, July 3, 2012

PdfLatex wont render images in the pdf output

If you are writing a conference paper and using latex to create your paper, then this post might be useful to you.

Typically each document has a header of the following sort.
 If you have chosen the "draft" option as in "\documentclass[10pt,a4paper,draft]{article}", then a simple code as given below, will not work.

\caption{Pipeline: Fetch Instruction}
 The figure wont render in the output pdf if the "draft" option is selected in the \document command.

Make sure you say its final, or just leave that option alone, like this:  \documentclass[10pt,a4paper]{article}.

Sunday, June 24, 2012

Measuring Query Difficulty

The trail of this reading began with Sonia Haiduc's paper "On the Effect of the Query in IR-based Concept Location". A lot of the paper is a review of the recent research focus in IR "predicting query difficulty". I was surprised to see that this paper was accepted. There were no results, not even an explanation of how evaluating query difficulty is different for concept location compared to traditional IR. Nevertheless, there was a good amount of literature review that guided me to several other interesting papers. There are "predictors" as well as "estimators" of query quality. Predictors usually examine the content of the query alone and generate a score that represents the query difficulty. The estimators use the ranked list to generate a score. The evaluation technique for all the algorithms in the umbrella of query difficulty prediction goes as follows:
The predictor or the estimator gives out score for each query that is expected to have correlation with the Average Precision for that query. The query quality score is correlated with the AP over all the queries in the entire test collection and in order to be doubly sure, a statistical significance test is conducted to illustrate that the two distributions originated from the same original system (demonstrate high p-value).

In order to get a better idea of how the specificities of the predictor or estimator algorithm goes I referred to the highly cited paper " Predicting Query Performance, SIGIR 2002". This paper is output of the Bruce Croft group and lives up to its reputation. The paper discuses an clarity based score, which measures the separateness of the query from the collection model. The following formula measures the divergence  = \sum_w p(w|q) log_2(p(w|q)/p(w|coll)). The numerator measures the entropy of the query while the denominator the divergence from the collection model. the p(w|q) is the query model. The experimentation is extensive spanning multiple test collections and setting up quantitative proof for all the thresholds they use in their experiments.
"Query Difficulty, Robustness, and Selective Application of Query Expansion" published in Advances in Information Retrieval in 2004, was one of the first to come up with a "use" of measuring query difficulty. They employed a number of retrieval models and test collections for their study. Not only did they correlate query difficulty with the P@10, AP and so on, they used the "relevance judgements"  also to understand its correlation with query difficulty.

There has been some work in the area of "traceability link recovery" for dealing with difficult queries. The method employed is simple. For queries that show poor performance, one uses an web based search to mine documents that could be used for query reformulation. I wonder if this comes under the umbrella of multi-modal retrieval.

I think it is worthwhile to spend some using these metrics on my own datasets and isolating "difficult" queries ahead of time to aid analysis.

Wednesday, April 18, 2012

Handling duplicates in ranking documents for Retrieval Algorithms

I never paid attention to the ranking step of a retrieval algorithm. All I had to do was sort the scores. These scores are computed by using a similarity (or divergence) function to compare a query with all the documents in the collection. What I had overlooked was the possibility of scores being equal. How do you rank documents with similar scores? Does it have a great impact on retrieval performance? Turns out, the answer to the second question is YES. To give a simple example, consider 6 documents with the following scores,

doc no   1      2      3       4    5        6
scores 0.87 0.97 0.97 0.85 0.49 0.87

Simple ranking would give out the following output

doc no   1      2      3    4    5   6
rank       3     1      2    5    6   4

If doc 3 is the relevant document,  it is still considered to be at rank 2 as opposed to rank 1(where it actually should be since it scores as much as doc2). The Precision goes from 1.00 to 0.5 just because of this sort of ranking.

This wikipedia article summarizes on ways to handle this. It seems like a trivial concept but it has tremendously improved bug localization results on my iBUGS dataset.

For 291 bugs, I compute the retrieval performance by using the "standard competition" ranking (1 2 2 4) and the retrieval performance shot up from 0.14 to 0.21 with a p-value of 0. Thus the improvement is statistically significant.

Monday, February 27, 2012

Quick Reference: cvs2cl

CVS is a primitive and unfriendly tool for version control (at least in my experience).

Gladly there is a tool called cvs2cl that makes it easier to manage the log files.

Here are the key concepts

1)Install cvs and cvs2cl (both available through sudo apt-get install commands)
2)Checkout the version you want into a directory
3) cd  
4) in this directory execute cvs2cl command

cvs2cl -T  gives out the changes in chronologically reverse order with locations where tags first appeared. But a word of caution, the tag here means, where a tag was first encountered as it appears in the logfile.

I tried
cvs2cl -T --chrono . But it gives out  inaccurate or even incorrect information here. Since, all it is doing is reversing the log file, and not really checking out information in reverse order

The option --xml outputs in xml format.

The overall command to track changes in trunk only with tag information when the tag first occurred (reverse chronological order) use this

cvs2cl --xml --no-wrap --summary --T -S -r --follow-only trunk

Each file in cvs is in a particular state, that is encoded in tag in the xml file. Meaning of file state is "Exp" most of the time and stands for experimental. Since we are tracking most of the files will be exp state, but in a release branch you might find "stab" for stable version of the file.

Hope this helps

Monday, January 23, 2012

Quick Reference: Student t's paired t-test for comparing retrieval algorithms

Given a set of queries Q of size n, two retrieval algorithms need to compared
calculate AP (Average precision) for each query using both the algorithms, call one AP1 and the second one AP2

To calculate if they are statistically significant, here are the steps

a) calculate D = AP1-AP2
b) t-score = mean(D)/(standard_deviation(D)/sqrt(n))
c) Feed it into a p-value calculator with df(degrees of freedom) = n-1 Click here

Pretty neat and quick