Sunday, June 24, 2012

Measuring Query Difficulty

The trail of this reading began with Sonia Haiduc's paper "On the Effect of the Query in IR-based Concept Location". A lot of the paper is a review of the recent research focus in IR "predicting query difficulty". I was surprised to see that this paper was accepted. There were no results, not even an explanation of how evaluating query difficulty is different for concept location compared to traditional IR. Nevertheless, there was a good amount of literature review that guided me to several other interesting papers. There are "predictors" as well as "estimators" of query quality. Predictors usually examine the content of the query alone and generate a score that represents the query difficulty. The estimators use the ranked list to generate a score. The evaluation technique for all the algorithms in the umbrella of query difficulty prediction goes as follows:
The predictor or the estimator gives out score for each query that is expected to have correlation with the Average Precision for that query. The query quality score is correlated with the AP over all the queries in the entire test collection and in order to be doubly sure, a statistical significance test is conducted to illustrate that the two distributions originated from the same original system (demonstrate high p-value).

In order to get a better idea of how the specificities of the predictor or estimator algorithm goes I referred to the highly cited paper " Predicting Query Performance, SIGIR 2002". This paper is output of the Bruce Croft group and lives up to its reputation. The paper discuses an clarity based score, which measures the separateness of the query from the collection model. The following formula measures the divergence  = \sum_w p(w|q) log_2(p(w|q)/p(w|coll)). The numerator measures the entropy of the query while the denominator the divergence from the collection model. the p(w|q) is the query model. The experimentation is extensive spanning multiple test collections and setting up quantitative proof for all the thresholds they use in their experiments.

"Query Difficulty, Robustness, and Selective Application of Query Expansion" published in Advances in Information Retrieval in 2004, was one of the first to come up with a "use" of measuring query difficulty. They employed a number of retrieval models and test collections for their study. Not only did they correlate query difficulty with the P@10, AP and so on, they used the "relevance judgements"  also to understand its correlation with query difficulty.

There has been some work in the area of "traceability link recovery" for dealing with difficult queries. The method employed is simple. For queries that show poor performance, one uses an web based search to mine documents that could be used for query reformulation. I wonder if this comes under the umbrella of multi-modal retrieval.

I think it is worthwhile to spend some using these metrics on my own datasets and isolating "difficult" queries ahead of time to aid analysis.