I have been researching how restricting the vocabulary of a dataset affects retrieval results and I am seeing some really counter-intuitive results
This the format of the experiment:
I create a vocabulary using some percentage of the documents not all. The percentages I used was (10,20,30,40,50,60,70,80,90) and then performed retrieval and calculated MAP
There were two key observations that I found interesting.
1) For software corpora, I found that even with 10% of documents, 45% of the original vocabulary was built. This means that software vocabulary tends to have a more uniform distribution of the terms/identifiers/variable-names across source files
2) With the Vector Space Model with tf-idf weighting, I found that with 10-30% of documents used to create the vocabulary (or dictionary) I got better performance as compared to original vocabulary. The only explanation I have for this result is that, some words are more important for retrieval than others. I do need to mention that the original vocabulary itself is a pruned version of the original raw vocabulary obtained by removing sparse terms in the document-term Matrix.