Testing for an IRS can get so freaking hard. The vocabulary sizes can run upto a few tens of thousands. That means you are dealing with a very very very high dimensional space. The term weights or term probabilities are extremely small. So small that you are dealing with values in log scale only. The relevance judgements for some queries are not reliable. The queries dig up some documents that might seem relevant to you (as a programmer/user) but the standardized judgements have not tagged the documents. This can get increasingly frustrating because you might be led to believe that there is a bug in your code. Changes made to the query or the document in response to relevance feedback is hard to interpret and understand, when dealing with few tens of thousands of words and tens of thousands of documents.
To circumvent this problem, it is essential to first create your own teeny tiny corpora of few tens of documents and few tens of words. If you wish to build topic models over your corpora make sure that your document-term matrix is tall and thin (not sparse). If you are using the basic Unigram and VSM kind of approaches, a short and stout (sparse) matrix might do the magic.
The relevance judgements should be created by you by manual inspection. To start out build a few topics with a few words that are very very distinct even to a human :). Use these topics to generate documents using the LDA document generation model.
This has proven useful to my experimentation and explorations and I hope the reader of this blog will find it useful too.