Given a set of queries Q of size n, two retrieval algorithms need to compared
calculate AP (Average precision) for each query using both the algorithms, call one AP1 and the second one AP2
To calculate if they are statistically significant, here are the steps
a) calculate D = AP1-AP2
b) t-score = mean(D)/(standard_deviation(D)/sqrt(n))
c) Feed it into a p-value calculator with df(degrees of freedom) = n-1 Click here
Pretty neat and quick
calculate AP (Average precision) for each query using both the algorithms, call one AP1 and the second one AP2
To calculate if they are statistically significant, here are the steps
a) calculate D = AP1-AP2
b) t-score = mean(D)/(standard_deviation(D)/sqrt(n))
c) Feed it into a p-value calculator with df(degrees of freedom) = n-1 Click here
Pretty neat and quick