Friday, August 1, 2014

Running Apache Spark Unit Tests Sequentially with Scala Specs2

Apache Spark has a growing community in the Machine Learning and Analytics world. One of the thing that often comes up when developing with Spark is the Unit tests for functions that take in an RDD and return an RDD. There is the famous Quantified Blog on Spark Testing with FunSuite which gives a great way to design the trait class and then use it in our test classes. But it was a little outdated (written for Spark 0.6). In other words, the system.clearproperty("spark.master.port") is no longer a property that exists in Spark 1.0.1. Thankfully the Spark Summit 2014 talk on "Spark Testing: Best Practices" is based on the latest version of Spark and has the right properties to set, namely spark.driver.port and spark.hostPort. We also used org.Specifications2 (scala Specifications) and Mockito libraries for testing, so our trait class looks a little different.

 import org.specs2.Specification  
 import org.specs2.mock.Mockito  
 import org.apache.spark.SparkContext  
 trait SparkTests extends Specification{  
  var sc: SparkContext = _  
  def runTest[A](name: String)(body: => A): A = {  
   System.clearProperty("spark.driver.port")  
   System.clearProperty("spark.hostPort")  
   sc = new SparkContext("local[4]", name)  
   try{  
    println("Running test " + name)  
    body  
   }  
   finally {  
    sc.stop  
    System.clearProperty("spark.driver.port")  
    System.clearProperty("spark.hostPort")  
    sc = null  
   }  
  }  
 }  

Your actual test will extend this trait and contain the "sequential" keyword

 class LogPreprocessorSpec extends Specification with Mockito with ScalaCheck with SparkTests {  
  sequential  

Last but not the least your build.sbt will contain the following:

 testOptions in Test += Tests.Argument("sequential")  

No comments:

Post a Comment