Friday, October 22, 2010

R Elastic MapReduce Hive and Distributed Text Mining Toolbox.... its a whirlpool

After having learned about MapReduce for solving a computer vision problem for my recent internship at Google, I felt like an enlightened person. I had been using R for text mining already. I wanted to use R for large scale text processing and for experiments with topic modeling. I started reading tutorials and working on some of those. I will now put down my understanding of each of the tools:

1) R text mining toolbox: Meant for local (client side) text processing and it uses the XML library
2) Hive: Hadoop interative, provides the framework to call map/reduce and also provides the DFS interface for storing files on the DFS.
3) RHIPE: R Hadoop integrated environment
4) Elastic MapReduce with R: a MapReduce framework for those who do not have their own clusters
5) Distributed Text Mining with R: An attempt to make seamless move form local to server side processing, from R-tm to R-distributed-tm

I have the following questions and confusions about the above packages

1) Hive and RHIPE and the distributed text mining toolbox need you to have your own clusters. Right?

2) If I have just one computer how would DFS work in case of HIVE

3) Are we facing with the problem of duplication of effort with the above packages?

I am hoping to get insights on the above questions in the next few days

No comments:

Post a Comment