Wednesday, October 27, 2010

Hadoop streaming and AMAZON EMR

I have been attempting to use Hadoop streaming in AMAZON EMR to do a simple word count for a bunch of text files. In order to get a handle on hadoop streaming and on amazon's EMR I took a very simplified data set too. Each text file had only one line of text in it (the line could contain arbitrarily large number of words).

The mapper is an R script, that splits the line into words and spits it back to the stream.


I decided to use the LongValueSum Aggregate reducer for adding the counts together, so I had to prefix my mapper output by LongValueSum


and specify the reducer to be "aggregate"

The questions I have now are the following:

1) The intermediate stage between mapper and reducer, just sorts the stream. It does not really combine by the keys. Am I right? I ask this because If I do not use "LongValueSum" as a prefix to the words output by the mapper, at the reducer I just receive the streams sorted by the keys, but not aggregated. That is I just receive ordered by K, as opposed to (K, list(Values)) at the reducer. Do I need to specify a combiner in my command?

2) How are other aggregate reducers used. I see, a lot of other reducers/aggregates/combiners available on

How are these combiners and reducer specified in an AMAZON EMR set up?

I believe an issue of this kind has been filed and fixed in Hadoop streaming, but I am not sure what version AMAZON EMR is hosting, and the version in which this fix is available.

3) How about custom input formats and record readers and writers. There are bunch of libraries written in Java. Is it sufficient to specify the java class name for each of these options?

Friday, October 22, 2010

R Elastic MapReduce Hive and Distributed Text Mining Toolbox.... its a whirlpool

After having learned about MapReduce for solving a computer vision problem for my recent internship at Google, I felt like an enlightened person. I had been using R for text mining already. I wanted to use R for large scale text processing and for experiments with topic modeling. I started reading tutorials and working on some of those. I will now put down my understanding of each of the tools:

1) R text mining toolbox: Meant for local (client side) text processing and it uses the XML library
2) Hive: Hadoop interative, provides the framework to call map/reduce and also provides the DFS interface for storing files on the DFS.
3) RHIPE: R Hadoop integrated environment
4) Elastic MapReduce with R: a MapReduce framework for those who do not have their own clusters
5) Distributed Text Mining with R: An attempt to make seamless move form local to server side processing, from R-tm to R-distributed-tm

I have the following questions and confusions about the above packages

1) Hive and RHIPE and the distributed text mining toolbox need you to have your own clusters. Right?

2) If I have just one computer how would DFS work in case of HIVE

3) Are we facing with the problem of duplication of effort with the above packages?

I am hoping to get insights on the above questions in the next few days

Friday, September 17, 2010

Wireless troubles on Ubuntu

I had been struggling with wireless issues on Ubuntu 10.04 on my new dell studio laptop. I tried rfkill but all in vain. The wireless was disabled and hard-blocked. I was unable to enable it. I have now finally found a fix, thanks to this website.

The fix is to use this command
rmmod dell_laptop

Wednesday, September 15, 2010

Parsing custom XML files using R

I have been trying to learn R, especially because I need to start using the text mining package in R to do loads of analysis on text data. A couple of datasets are already indexed in R but I wanted to index my own datasets written in custom XML formats. The tutorial is very useful. And I was able to get some success by writing my own XMLreaders and XMLsources. The end goal is to have a corpus object (say myCorpus) that has "PlainText Documents" so if you type the following command

> class(myCorpus)
TextDocument "character"

The XMLsource is something that reads all the files and extracts the XMLNode list which is then processed by XMLreader (that works on each XMLNode in the list) to return a List of "TextDocument"

Here is the custom XML file

mySource <- function(x, encoding = "UTF-8") XMLSource(x, function(tree) ;
xmlChildren(tree[1]]$children$corpus) myXMLReader, encoding);
myXMLReader <- readXML( spec = list(Content = list("node", "/DOC"), id = list("node", "/DOC/DOCNO")), doc = PlainTextDocument());
myCorpus <- Corpus(mySource("/home/shivani/research/mycode/R/query.xml"));

class(mycorpus[[1]]) The things that I did not like about this method of created a XMLSource is that it is extremely tailor made. See xmlChildren(tree[[1]]$children$corpus. This was done by trail and error, trying different functions till you find one that gives you a XMLNode list in return. Also, even after all this monkey-ing around I have not been able to get rid of the docid in my parsed document. Since there is a mix of data in the finally extracted documents.

If anybody could suggest a way to extract the text in the document without extracting a document number that would be great.

<?xml version="1.0"?>
emission obama dioxide
wall street market stock dollar
global greenhouse pollutants greenhouse dioxide
soldier field combat field
student student community
emission obama battle obama afghanistan field obama soldier soldier volumes emission
afghanistan traders field dollar iraq dollar iraq
emission obama soldier warming greenhouse obama carbon
dollar market dollar america
global dioxide

Thursday, April 29, 2010

Dreams of summer

So I got accepted to Google Inc Mountainview CA for a summer internship. Not only am I excited but I am also encouraged. My last claim to fame happened in Aug 2008. The long dry spell was killing me. Although, this means a lot of work it also means I am worth it :)

I was also thinking about the things I need to occupy my summer with and here they are

1. Google Internship: Do a good job at it
2. Develop C/C++ version of the Gibbs Sampler and look at implementation issues in a large-large database
3. Look into usability of MapReduce for Unigram/VSM/LDA-combined representation to come up with a really neat representation
4. Push out the IEEE-T-SE paper on comparitive study ... May 18
5. Push out the IEEE-T-KDE paper on Unigram + LDA ... May 18
6. Look at my Lemur-toolkit extensions and get back in touch with that code
7. Develop a GUI for retrieval engine and document existing matlab/c/c++ code

Cant wait for life to begin!!!

Thursday, March 18, 2010

Perl: Detecting two words concatenated into one sepearted by uppercase character

If you are given a word 'textArea' and want to be able to separate them into 'text Area', then here is the perl code to help you achieve it

$item = 'textAreaBuffer';
if($item =~ m/.*?[a-z].*?[A-Z]/){
while($item =~ m/([a-z][A-Z])/g){


print FILE2 substr($item,0,$rem[$count]-1);
print FILE2 " "; }
print FILE2 substr($item,$rem[$count-1]-1,$rem[$count]-$rem[$count-1]);
print FILE2 " ";
print FILE2 substr($item,$rem[@rem-1]-1);
print FILE2 " ";

Wednesday, March 17, 2010

Vocabulary handling for software reuse

Vocabulary mismatch is even more of a pronounced problem when dealing with corpus made of source code. There are several tools used in the past and here are some of them that I will need to research and learn about

1) Soundex: Phonetic variations of the words are captured
2) Lexical affinity: A sliding window (grep-like) tool that calculates how close are two identifier names
3) Separate words using "_": counter_activity -> counter activity
4) Separate words using case sensitivity: LoadBuffer -> Load buffer

One could additionally use WORDNET and spelling correction or missing character handling to improve upon the existing techniques

Sunday, February 28, 2010

Combining discriminative and Generative training

This blog is a continuation of the previous post Targeting discrimination as opposed to similarity. In model learning, given labeled data there is always a choice to make, discriminative or generative. There are pros and cons of either methods and usually a two-stage approach is used to combine the power of the two. I found a very recent paper, Multi-Conditional Learning published in AAAI 2006, on combining the power of the two at the time of learning. The likelihood function is now composed of the generative likelihood and the discriminative likelihood. This is possible only because we have labeled data. The discriminative-generative pair used for illustration is primarily the CRF-MRF. The MRF is used to capture topic co-occurrences. The word co-occurrences are captured by the latent topic model. However, by using a MCL regularizer (discriminative component) in the learning algorithm, the model being learnt is actually a combination of MRF and CRF, thereby combining the power of the two.The authors claim huge performance gains in document retrieval on the 20newsgroups dataset.

Saturday, February 27, 2010

Review: IR application into software libraries: Concept graphs

I bumped upon this paper recently Source Code Retrieval using Conceptual Similarity published in 2004, RAIO. The main idea of the paper is to introduce concept graphs to be able to represent software code at two levels of granularity, structured and unstructured. Concept graphs enforce relationships between the variable names (unstructured information) based on the constructs of the programming language. Once the concept graphs are created, graph matching algorithms are used to find similar code segments. The baseline used is the Okapi IRS on the unstructured information in the code. They also present a combined approach where the scores obtained from Okapi and the Concept graphs are used to rank documents and queries (both are software segments). A corpus of size 2392 was used with a subset of 25 documents separated out as queries. In order to prove that the structure of the code plays an important role in matching (of code segments) these 25 documents were modified by performing transformation on the variable names etc. The results were reported using two measures: MMR and P@n n being the order of 5 (for this corpus)

Thursday, February 18, 2010


Tuesday, February 9, 2010

Targetting discrimination as opposed to similarity

Clustering mechanism with data that is expected to follow continuous distributions use an objective function to measure the optimum clustering of the documents. Such an objective function is usually a ratio of two quantities : inter-cluster distance to intra-cluster distance. Such a ratio measures the power of the clustering algorithm to bring together similar data samples and separate apart the dissimilar data samples. Fortunately for Gaussian models, this works out into a nice clean equation. No wonder people LOVE the Gaussian model.

When we are dealing with discrete data, the most likely distribution is the multinomial distribution. To add more control over the behavior of the model one could use dirichlet priors. The document clustering approaches in the discrete domain merely use the EM or the gibbs sampling approach to fit data optimally to the discrete model (especially latent mixture model). There is no underlying objective function that says, this discrete (latent) model should also be able to optimally sepearte out one topic from the other. Thus one could end up with 2 topics generating the exact same set of words.

This is definitely something that is lacking in the EM algorithms for fitting discrete data....
Maybe there has been work done on this already. But I am yet to discover it

Friday, January 29, 2010

TIME corpus... blah!!!

The TIME corpus is the most useless and "wrong" corpus that can ever be.

  • A little analysis on the corpus reveals that the corpus is built off two main topics, and the rest of the topics are under-represented.
  • The number of words is really high as compared to the number of documents, thats why the representation itself is very sparse.
  • Due to the high ratio of (Numof words) to (Num of documents) it is impossible to train a topic model on this corpus and expect it to represent such a corpus well
  • the qrels (relevance judgements) are INCORRECT. A visual exploration by comparing words reveal no similarity (at least on term-term basis) between the queries and the documents marked relevant vis-a-vis the queries.
So if you want to use TIME corpus, use it at your own RISK

Friday, January 8, 2010

The trouble with testing IRS

Testing for an IRS can get so freaking hard. The vocabulary sizes can run upto a few tens of thousands. That means you are dealing with a very very very high dimensional space. The term weights or term probabilities are extremely small. So small that you are dealing with values in log scale only. The relevance judgements for some queries are not reliable. The queries dig up some documents that might seem relevant to you (as a programmer/user) but the standardized judgements have not tagged the documents. This can get increasingly frustrating because you might be led to believe that there is a bug in your code. Changes made to the query or the document in response to relevance feedback is hard to interpret and understand, when dealing with few tens of thousands of words and tens of thousands of documents.

To circumvent this problem, it is essential to first create your own teeny tiny corpora of few tens of documents and few tens of words. If you wish to build topic models over your corpora make sure that your document-term matrix is tall and thin (not sparse). If you are using the basic Unigram and VSM kind of approaches, a short and stout (sparse) matrix might do the magic.
The relevance judgements should be created by you by manual inspection. To start out build a few topics with a few words that are very very distinct even to a human :). Use these topics to generate documents using the LDA document generation model.

This has proven useful to my experimentation and explorations and I hope the reader of this blog will find it useful too.