Wednesday, April 18, 2012

Handling duplicates in ranking documents for Retrieval Algorithms

I never paid attention to the ranking step of a retrieval algorithm. All I had to do was sort the scores. These scores are computed by using a similarity (or divergence) function to compare a query with all the documents in the collection. What I had overlooked was the possibility of scores being equal. How do you rank documents with similar scores? Does it have a great impact on retrieval performance? Turns out, the answer to the second question is YES. To give a simple example, consider 6 documents with the following scores,

doc no   1      2      3       4    5        6
scores 0.87 0.97 0.97 0.85 0.49 0.87

Simple ranking would give out the following output

doc no   1      2      3    4    5   6
rank       3     1      2    5    6   4

If doc 3 is the relevant document,  it is still considered to be at rank 2 as opposed to rank 1(where it actually should be since it scores as much as doc2). The Precision goes from 1.00 to 0.5 just because of this sort of ranking.

This wikipedia article summarizes on ways to handle this. It seems like a trivial concept but it has tremendously improved bug localization results on my iBUGS dataset.

For 291 bugs, I compute the retrieval performance by using the "standard competition" ranking (1 2 2 4) and the retrieval performance shot up from 0.14 to 0.21 with a p-value of 0. Thus the improvement is statistically significant.