Friday, August 1, 2014

Adding code to your blogger blog

A big shout out to those who wrote this awesome tool that takes in code and formats it in html so that you can paste it in your blog

http://codeformatter.blogspot.com/

Running Apache Spark Unit Tests Sequentially with Scala Specs2

Apache Spark has a growing community in the Machine Learning and Analytics world. One of the thing that often comes up when developing with Spark is the Unit tests for functions that take in an RDD and return an RDD. There is the famous Quantified Blog on Spark Testing with FunSuite which gives a great way to design the trait class and then use it in our test classes. But it was a little outdated (written for Spark 0.6). In other words, the system.clearproperty("spark.master.port") is no longer a property that exists in Spark 1.0.1. Thankfully the Spark Summit 2014 talk on "Spark Testing: Best Practices" is based on the latest version of Spark and has the right properties to set, namely spark.driver.port and spark.hostPort. We also used org.Specifications2 (scala Specifications) and Mockito libraries for testing, so our trait class looks a little different.

 import org.specs2.Specification  
 import org.specs2.mock.Mockito  
 import org.apache.spark.SparkContext  
 trait SparkTests extends Specification{  
  var sc: SparkContext = _  
  def runTest[A](name: String)(body: => A): A = {  
   System.clearProperty("spark.driver.port")  
   System.clearProperty("spark.hostPort")  
   sc = new SparkContext("local[4]", name)  
   try{  
    println("Running test " + name)  
    body  
   }  
   finally {  
    sc.stop  
    System.clearProperty("spark.driver.port")  
    System.clearProperty("spark.hostPort")  
    sc = null  
   }  
  }  
 }  

Your actual test will extend this trait and contain the "sequential" keyword

 class LogPreprocessorSpec extends Specification with Mockito with ScalaCheck with SparkTests {  
  sequential  

Last but not the least your build.sbt will contain the following:

 testOptions in Test += Tests.Argument("sequential")  

Thursday, May 1, 2014

Building Apache Spark Jars

Apache Spark seems to have taken over the Big Data world. Apache Spark is an in-memory solution that reads data from HDFS. Ever since its first release, Spark has received much attention from the Big Data community and now has a huge fan following from academicians, researchers and the industry. Unfortunately, the documentation for the work has not been keeping up with the code development (in my opinion). There is especially one issue that took me a long time to figure out and I am hoping this post will cut-short the time spent by others trying to solve the same problem.

The Goal: I already have a cluster of spark set up on a set of machines. Let me call this the "lab" cluster. This lab cluster came pre-installed with Hadoop. I want to run spark jobs (written in scala) on this cluster. There are two ways to do it
a) Running sbt run from the root of the sbt project that contains the scala code.
b) Run the fat jar created by sbt assembly as follows

    java -cp path-to-fat-jar MainClassName

The problem: According to the guidelines provided on Spark's documentation, the sbt plugin needs to specify library dependencies to spark and to the relevant hadoop version as follows:

libraryDependencies += "org.apache.spark" %% "spark-core" % "0.9.1"
libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "your-hadoop-version"
 The issue with this is that the fat jar thus compiled now contains libraries that are already present on my cluster. If there any version difference in your hadoop dependency and in hadoop installed on your cluster you will run into a common client mismatch error

The solution:    I found this great resource finally. Here they specially mention that you need to eliminate spark as well as hadoop dependencies and the fat jar should contain nothing but the code for your application. This can be done simply by adding "provided" keyword ahead of each of the library dependency in your build.sbt or build.scala. With this change watch how your sbt excludes all spark, akka and hadoop dependencies. The final jar can be launched using the above mentioned command without any issues

Next Up:    While the cloudera page (and the selflessness of a colleague at work) saved my day, I have now run into another problem. I would like to run this jar using oozie (hadoop's job scheduler) but then again, there is very limited documentation (except for may be one hint). Hopefully I will figure out the puzzle and write another blog to help folks out there who are struggling like me.


Wednesday, March 26, 2014

Hive: Lessons learned

The past few days I have been playing with Hive for some data analysis and I wanted to put down what I learned

a) Exporting data from hive to csv

If you are using hue, then it provides a convenient way to export to csv or excel format. But if not then you can use the following preamble before the select statement

INSERT OVERWRITE LOCAL DIRECTORY '/path/out.csv' ROW FORMATTED DELIMITED FIELDS TERMINATED BY ','

b) Hive does not allow "select" statements in the "where" clause. for example

SELECT file_id, file_type
FROM file_metadatas
WHERE file_type IN (select types from allowed_file_types)

WILL NOT WORK in Hive, but will work in MySQL

Instead one can use a join 

SELECT a.file_id, a.file_type
FROM file_metadatas a
JOIN allowed_file_types b
ON b.types = a.file_type

c) Date functions: Computing a date that is ~6 months behind the current date

DATE_SUB(TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP( ))),180)

d) hive regex: Regular expressions can be used to extract parts of a text field. Here is an example
 
Extracting the file extension:  regexp_extract(file_name,'\\.[0-9a-zA-Z]+$',0))  extracts the 0th (1st) match of file_name with the input string

e) string manipulation support: Strings can be manipulated in hive. One such example is the "lower" function that converts all characters to lower

Thursday, July 18, 2013

Matlab Tip: Changing line-width of all the lines via commandline

Get the handle to the line using:
hline = findobj(gcf, 'type', 'line');
Then  you can change some property for all the line objects:
set(hline,'LineWidth',3)
or just for some of them :
set(hline(1),'LineWidth',3)
set(hline(2:3),'LineStyle',':')
idx = [4 5];
set(hline(idx),'Marker','*')
This is not an original post. Here is the original post

Thursday, April 11, 2013

Surviving the PhD program

As I am nearing the end of the longest phase of my academic life, I wanted to put together all resources that are useful to keep one going during the frustrating times of a PhD program. I truly believe that keeping a positive emotional state and using our energies in the right direction is the key to getting through the program. I am grateful to have a good support of friends and family but the internet is a great place to seek support in times of trouble. Here are some of the most useful links that I found.

  • Dissertation Writers Toolkit : A great resource for those who are procrastinating on writing. This webpage also contains tons of other helpful material, like balanced-life chart, tools on staying organized, positive affirmations on writing and so on.
  • Life is easier when you can laugh at yourself. Here are some daily affirmations for doctoral students. But I stayed away from PhD comics as much as I could.
  • The Thesis Whisperer is another useful blog, that helped me realize I was not alone in some of my struggles. In fact my struggles are perfectly normal for a PhD student.
  • Here are some productivity tricks for graduate students that I found useful. In fact, following one of the suggestions from this page, I purchased multiple chargers for my laptop so I could save time on getting started with my day.
  • The 3-month thesis is also a good resource for thesis writing

Wednesday, April 3, 2013

Trailing Slash in rsync command

Just making a quick reference to the rysnc manual in order to synchronize directories

The command used to copy folders is as follows:
 
rsync -avz foo:src/bar /data/tmp
 
This command copies the directory bar in /data/tmp .. 
meaning at the end of this command, you will have /data/tmp/bar folder
 
 
If you just want to sync the folders then use the trailing slash like this
 
rsync -avz foo:src/bar/ /data/tmp 

Now only the contents of bar will be copied into /data/tmp folder;
 you will not find a folder called bar in /data/tmp