Wednesday, October 27, 2010

Hadoop streaming and AMAZON EMR

I have been attempting to use Hadoop streaming in AMAZON EMR to do a simple word count for a bunch of text files. In order to get a handle on hadoop streaming and on amazon's EMR I took a very simplified data set too. Each text file had only one line of text in it (the line could contain arbitrarily large number of words).

The mapper is an R script, that splits the line into words and spits it back to the stream.

cat(wordList[i],"\t1\n")

I decided to use the LongValueSum Aggregate reducer for adding the counts together, so I had to prefix my mapper output by LongValueSum

cat("LongValueSum:",wordList[i],"\t1\n")

and specify the reducer to be "aggregate"

The questions I have now are the following:

1) The intermediate stage between mapper and reducer, just sorts the stream. It does not really combine by the keys. Am I right? I ask this because If I do not use "LongValueSum" as a prefix to the words output by the mapper, at the reducer I just receive the streams sorted by the keys, but not aggregated. That is I just receive ordered by K, as opposed to (K, list(Values)) at the reducer. Do I need to specify a combiner in my command?

2) How are other aggregate reducers used. I see, a lot of other reducers/aggregates/combiners available on http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapred/lib/aggregate/package-summary.html

How are these combiners and reducer specified in an AMAZON EMR set up?

I believe an issue of this kind has been filed and fixed in Hadoop streaming, but I am not sure what version AMAZON EMR is hosting, and the version in which this fix is available.
https://issues.apache.org/jira/browse/HADOOP-4842

3) How about custom input formats and record readers and writers. There are bunch of libraries written in Java. Is it sufficient to specify the java class name for each of these options?

No comments:

Post a Comment