Wednesday, October 27, 2010

Hadoop streaming and AMAZON EMR

I have been attempting to use Hadoop streaming in AMAZON EMR to do a simple word count for a bunch of text files. In order to get a handle on hadoop streaming and on amazon's EMR I took a very simplified data set too. Each text file had only one line of text in it (the line could contain arbitrarily large number of words).

The mapper is an R script, that splits the line into words and spits it back to the stream.

cat(wordList[i],"\t1\n")

I decided to use the LongValueSum Aggregate reducer for adding the counts together, so I had to prefix my mapper output by LongValueSum

cat("LongValueSum:",wordList[i],"\t1\n")

and specify the reducer to be "aggregate"

The questions I have now are the following:

1) The intermediate stage between mapper and reducer, just sorts the stream. It does not really combine by the keys. Am I right? I ask this because If I do not use "LongValueSum" as a prefix to the words output by the mapper, at the reducer I just receive the streams sorted by the keys, but not aggregated. That is I just receive ordered by K, as opposed to (K, list(Values)) at the reducer. Do I need to specify a combiner in my command?

2) How are other aggregate reducers used. I see, a lot of other reducers/aggregates/combiners available on http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapred/lib/aggregate/package-summary.html

How are these combiners and reducer specified in an AMAZON EMR set up?

I believe an issue of this kind has been filed and fixed in Hadoop streaming, but I am not sure what version AMAZON EMR is hosting, and the version in which this fix is available.
https://issues.apache.org/jira/browse/HADOOP-4842

3) How about custom input formats and record readers and writers. There are bunch of libraries written in Java. Is it sufficient to specify the java class name for each of these options?

Friday, October 22, 2010

R Elastic MapReduce Hive and Distributed Text Mining Toolbox.... its a whirlpool

After having learned about MapReduce for solving a computer vision problem for my recent internship at Google, I felt like an enlightened person. I had been using R for text mining already. I wanted to use R for large scale text processing and for experiments with topic modeling. I started reading tutorials and working on some of those. I will now put down my understanding of each of the tools:

1) R text mining toolbox: Meant for local (client side) text processing and it uses the XML library
2) Hive: Hadoop interative, provides the framework to call map/reduce and also provides the DFS interface for storing files on the DFS.
3) RHIPE: R Hadoop integrated environment
4) Elastic MapReduce with R: a MapReduce framework for those who do not have their own clusters
5) Distributed Text Mining with R: An attempt to make seamless move form local to server side processing, from R-tm to R-distributed-tm

I have the following questions and confusions about the above packages

1) Hive and RHIPE and the distributed text mining toolbox need you to have your own clusters. Right?

2) If I have just one computer how would DFS work in case of HIVE

3) Are we facing with the problem of duplication of effort with the above packages?

I am hoping to get insights on the above questions in the next few days