Thursday, May 1, 2014

Building Apache Spark Jars

Apache Spark seems to have taken over the Big Data world. Apache Spark is an in-memory solution that reads data from HDFS. Ever since its first release, Spark has received much attention from the Big Data community and now has a huge fan following from academicians, researchers and the industry. Unfortunately, the documentation for the work has not been keeping up with the code development (in my opinion). There is especially one issue that took me a long time to figure out and I am hoping this post will cut-short the time spent by others trying to solve the same problem.

The Goal: I already have a cluster of spark set up on a set of machines. Let me call this the "lab" cluster. This lab cluster came pre-installed with Hadoop. I want to run spark jobs (written in scala) on this cluster. There are two ways to do it
a) Running sbt run from the root of the sbt project that contains the scala code.
b) Run the fat jar created by sbt assembly as follows

    java -cp path-to-fat-jar MainClassName

The problem: According to the guidelines provided on Spark's documentation, the sbt plugin needs to specify library dependencies to spark and to the relevant hadoop version as follows:

libraryDependencies += "org.apache.spark" %% "spark-core" % "0.9.1"
libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "your-hadoop-version"
 The issue with this is that the fat jar thus compiled now contains libraries that are already present on my cluster. If there any version difference in your hadoop dependency and in hadoop installed on your cluster you will run into a common client mismatch error

The solution:    I found this great resource finally. Here they specially mention that you need to eliminate spark as well as hadoop dependencies and the fat jar should contain nothing but the code for your application. This can be done simply by adding "provided" keyword ahead of each of the library dependency in your build.sbt or build.scala. With this change watch how your sbt excludes all spark, akka and hadoop dependencies. The final jar can be launched using the above mentioned command without any issues

Next Up:    While the cloudera page (and the selflessness of a colleague at work) saved my day, I have now run into another problem. I would like to run this jar using oozie (hadoop's job scheduler) but then again, there is very limited documentation (except for may be one hint). Hopefully I will figure out the puzzle and write another blog to help folks out there who are struggling like me.