Friday, October 24, 2014

Difference between Machine Learning and Data Science: Student Opportunity Lab #GHC14

I was invited (by a young aspiring data scientist Sally) to run a Student Opportunity Lab (SOL) at Grace Hopper Celebration (GHC) of Women in Computing (WiC) Conference at Pheonix, AZ. SOLs are meant for networking between attendees and presenters. Each session of the SOL involves a large round-table, with two facilitators (who are believed to be experts/enthusiasts of the topic the session covers), and 10 attendees and lasts 20 minutes long.  The attendees of the table can ask any question to the facilitators related to the topic of discussion. After a session is over, there is a 10 minutes of break where attendees can network with facilitators or among each other and find another table (another topic of discussion). There are about 30-40 tables reserved for various topics ranging from security, data science to career related topics.

The topic of my table was "From Classroom to Industry: Learning to be a data scientist". Although my role at Box is that of a Software Engineer in Machine Learning, I identify myself as a data scientist who loves coding/implementing things. My excitement comes from applying well known machine learning algorithms and using data analysis to gather insights from data used in various domains. I graduated from PhD in Applied Machine Learning a year ago (Dec 2013), and I have learned that there are vast differences in the way academia views the field and how the industry views it. My main goal of SOL was to bridge this gap.

One of the most common question I was asked was " What is the difference between Machine Learning and Data Science". It seems like a silly question, but a year ago when I was finishing my PhD I would not have known the difference. But now, a year later, I see it. This is such a common question that it has a Quora page

Machine Learning is just one of the tools used by data scientists to analyze data. Data Science is an applied interdisciplinary field, where there is so much more happening than just machine learning. First,  applying machine learning to a problem in the industry is very different from that in academia. Here are the differences.

In academia, when solving a machine learning problem, the datasets are mostly publicly available and define the problem well. So the goal is to solve the problem using various machine learning tools or improve upon existing tools to solve that problem. In industry, there are no publicly available datasets. If one is lucky, data of interest can be mined from logs or relational databases. Even if that is the case, most of the data is unlabeled. So even for a data classification problem, at best one can use an unsupervised algorithm or a semi-supervised algorithms. Unlike in academia, the domain of the problem in industry maybe completely new and unknown to the research community, there by having no prior literature to start from.

Second, the scale of problems in industry are at least 10x times the problems in academia. This means, infrastructure needs to be built and present around the machine learning problem. Big data is often stored in HDFS and accessed through SQL-like interfaces called as Hive (For a more detailed article on the infrastructure required to handle big-data see (insert link)). So a Data Scientist needs to know how to access this data from HDFS.

Third, the data we mine is essentially unlabeled in the industry unless there are specific ways to log (record) click behavior of users. While this feedback can be noisy its the best bet to gauge if a model is making the right recommendations/predictions. In problems like churn analysis or lead-generation, the labels come at a heavy cost of losing a customer or losing a lead. Thus Data Scientists also need to be able to mine user's click behavior or other metrics that gauge the success of a model. Data Scientists also need to be able to

Last but not the least, Data Scientists may do a lot more with presenting their findings. In pure Machine Learning, there are pre defined metrics where one can show improved performance or not. In industry problems, these metrics may not be the end-all and be-all of the story one wishes to tell with data.

This Quora page has good discussion on the topic. I had a great time meeting people from all walks of life and sharing my experiences with them. I wanted to use this session as a way give back (or rather pay forward) to the GHC community by mentoring students, by being honest about my struggles with transition to industry from academia, by helping students watch out for pitfalls when applying for jobs/ accepting job offers in the data analytics field. A secondary and unexpected outcome of this was that I learned about my own journey as a Machine Learning person. Although I am not there yet, I am an aspiring data scientist too and I have learned and grown so much since I started my journey in 2008 in this field.



Friday, October 10, 2014

Data Science Lightning Talks: #GHC14

The lightning talks at GHC's Data Science track were super fun and covered a wide range of topics. Although data scientists or machine learning folks already know most of these concepts, its great to get a refresher. As a plus the passion of the speakers was contagious.

Trusting User Annotations on the Web for Cultural Heritage Domains Presenter by Archana Nottamkandath (Vrije Universiteit Amsterdam) 
Whenever labels/annotations for unlabeled data are acquired through crowdsourcing, one needs techniques to compute the quality of annotations. The talk highlighted a very relevant application of the problem of 'gauging quality of user defined annotations'. Cultural Heritage Institutions are digitizing their painting collections and collection annotations via crowd sourcing. In order to detect  annotations of poor quality of malicious intent, features like timestamp, age of the user, geolocation were used. Some form of ground truth was generated using domain experts. 
Only Trust your Data: Experiment, Measure and Ship! Presenter: Audrey Colle (Microsoft) Audreyc@microsoft.com
This was a really fun talk, no concepts here were new, but several take-aways from an experienced person involved in A/B testing. The first take away or idea is the One Factor At a Time (OFAT) approach, where one only  tweaks a few parameters of the model. The second useful advice was to run A/A tests along with A/B tests. A/A Test should give exactly same results, help test sanity of the A/B pipeline. Third advice was to use more than 1 metric and to design metrics keeping the end user in mind. Last but not the least, the speaker advised on using t-test to compare metrics coming from A/B.
Making Sense of Nonsense: Graph Mining and Fraud Detection Presenter: Sarah Powers (Oak Ridge National Laboratory) 
In this talk I learned about how Health care network can represented as a graph and how fraud detection can be done on this network. In a health-care network, nodes are insurance companies, hospitals and providers, and linkages indicate connections. Discovery of patterns, connected components and figure out if these indicate fraud or not. The speaker admitted challenges related to scaling of infrastructure, scaling of algorithms, assumptions and pattern discovery as the network data grows to become BigData. 
Power of Polka Dots – Using Data to Tell Stories Presenter: Mridula (Mita) Mahadevan (Intuit Inc.) 
Another talk about the power of networks. This time more focused on the implementation details. The speaker talked about the back-end data store the Web-app ==> Kafka ==> ETL ==> Aggregate to create aggregates and the RESTAPI for real-time access to the model parameters
Recipes of Your Favorite Restaurant Presenter: Yichi Zhang (University of Michigan - Ann Arbor) 
This was a fun end to end data science project developed by a student at Michigan. She used yelp review search api and the big oven recipe api to collect data. She talks about her challenges in data collection and how she overcame those challenges. Her insights were entertaining more than useful.

Big Data Anomaly Detection Using Machine Learning Algorithms Presenter: Yi Wang (Microsoft) 
A talk on anomaly detection in a industry setting. As is obvious, the approaches were developed and used by services team, to detect and generate anomalies for their servers and services. They analyzed tons of traffic data aggregate in real time and fed into an anomaly detection to create alerts. 
Breaking Language Barriers: Recent Successes and Challenges in Machine Translation by Marine Carpuat (National Research Council Canada) 
This talk was more of a primer on Machine Translation. The speaker progressed from using unigram approaches to create the translation dictionary and then went on to explain how bigram approaches would help in narrowing the search space by specifying the context.

Application of Advanced Machine Learning Techniques in Home Valuation Presenter: Sowmya Undapalli (Intel Corporation) 
This was yet another fun Kaggle-kind of projects where the speaker used Machine learning to build a DIY home evaluation system.  She demonstrated her code using IPython and scikit learn toolbox and shared the various classification algorithms she had used. She had two data sets, one teeny-tiny containing 300 data points in a single zip code of Tempe, AZ. The bigger dataset had 24300 data points 115 features reduced to 20 variables. All was good, except for her findings. The fact that the square footage has the biggest impact on the selling price of the house, is a little too much of an anti climax in my opinion. Still, a fun talk and a fun project to learn about

The Power of Context in Real-World Data Science Applications #GHC14

The Data Science in Practical Applications session at GHC Data Science Track  ranged from AI to Fraud detection and much more. You can access the notes of the session here. I elaborate a little further on these talks and add a little 2 cents of my perspective.

"AI: Return to Meaning" by David  from Bridgewater Associates. 

AI has been a long forgotten term and gained significant bad reputation before the advent of statistical Machine Learning. So this talk was refreshing in that, it highlighted the contribution AI can make given the advancements in the fields of machine learning, BigData and Data Science.

David started out by differentiating between AI and machine learning using a simple decision problem as an example. AI is traditionally a theory driven process, allowing people to make deductions or create rules. AI has been traditionally used as an interactive approach to answer questions. Statistical Machine Learning on the other hand creates "latent" rules by analyzing large amount of data using correlation analysis and optimization techniques.

David then went on to talk about how Statistical Machine learning has limitations. For one, when machine learning is used to make decisions, one loses the ability to interpret findings. In other words, it gets harder to figure out what specific variables help in decision making. Secondly, given that we trained our machine learning algorithm to solve a specific decision problem, the context is assumed in the feature extraction stage of the process. When the context changes a new machine learning algorithm needs to be trained and learned in order to answer the new decision problem.

David comes from a background at IBM where he worked on the Watson's Architecture that won the jeopardy challenge. In order to play jeopardy, you not only need a bunch of decision making algorithms at hand, you need to be able to figure out the context in which the question was posed in order to select the specific machine learning problem. An interactive session with the computer based on AI approaches can help with narrowing down the context.

David then explained how winning the Jeopardy was a 4 year effort into developing NLP techniques, Machine Learning algorithms using tons of data, lots of researchers and developers.  This talk was a refresher on AI concepts and where it continues to be relevant admist all the advancement in Data Science

SiftScience: Fraud Detection with Machine Learning, Katherine Loh

There are some data-science or machine-learning problems that are just evergreen in the industry. One such problem is Fraud Detection. According to Katherine Loh, Fraud Detection is not limited to detecting credit card fraud, but also detecting spamming users, fake sellers, abuse of promo program. She seemed to have great knowledge of the work they did to build this end-to-end system for Fraud detection and Sift Science.

Since Fraud detection is a well defined and extensively researched data science problem the various stages of the solution i.e. data collection, feature extraction, data modeling and accuracy reporting have been laid out already. For example, since Fraud detection is a binary classification problem, Naive Bayes and Logistic Regression are the goto ML approaches to solving the problem. Thats why, Katherine talked about feature extraction and feature engineering as their 'secret sauce' as opposed to complex machine learning algorithms.

Since Fraud detection means detection of fraud, there was no assumptions made about the transactions that were not labeled as fraud. Customers mainly provided feedback for transactions, labeling fraudulent transactions as "bad". They engineered over  1000+ features from various sources and talked about how they provide a test and development environment to allow for continuous mining and engineering of new and useful features, allowing for the model to learn and grow with time and allowing for custom features for different enterprises.

One of the things I loved about her talk was the detailed information she gave on the kind of features that they mined and how they eyeballed results to gain a deeper understanding of the correlation between variables. She talks about how similar IP addresses during the same duration or same credit card from geologically distributed locations are indicators of fraud. The talk also impressed upon the audience that it is more beneficial to track a user's behavior as a temporal sequence rather than treating them as a bag of isolated events. It was refreshing to see how domain knowledge was incorporated so well into their algorithm, rather than blind feature extraction.

The overall architecture of their ML algorithm highlighted HBase as the data store.  SiftScience boasted a < 200ms (real-time) response time to test-data (new transactions). They have also developed a dashboard to help fraud analysts interpret results.

Overall a great talk. However, I hoped that time would be spent on dealing with common challenges like, data-corruption due to noise, mislabeled data, duplicate label, wrong data, missing fields. I also wish more time was spent on size of their collection, language used to prototype and implement new ideas. Katherine briefly mentioned on-line learning but provided no details of the approach.

New Techniques in Road Network Comparison

This is a topic I knew little about and also the only presentation from academia. Road Networks have been used by researchers to study migration of birds, by GPS devices to correct roadmaps based on trajectories taken by cars. Comparing Road networks help in detecting changes in road trajectories and  comparing two competing road networks reconstructed to the ground truth. From there on the presentation quickly took an extremely sharp turn towards various metrics to compare road networks and there strengths and weaknesses. The talk got so mathematically so quickly that most people (including) were lost. Overall a great talk, but could have been more general and educational.
The talk then gets quickly technical, and very research oriented, which is great, but one has to understand the audience at GHC. People are here to understand the overall idea of the

Perspectives

Attend the 3 talks on varied topics made me realize now Research and Development is about "standing on shoulder of giants". Not all approaches will work on every single sub-problem, one needs a combination of approaches to build an effective ML approach. This is what differentiates academia and industry. In academia we are trying to prove that an algorithm is better than the one proposed by our peers that researched on the topic before us. In industry, the goal is to use these algorithms as tools to solve a real-world ML problem, by first understand the context (domain) and then using relevant features/algorithms for that domain.


Friday, August 1, 2014

Adding code to your blogger blog

A big shout out to those who wrote this awesome tool that takes in code and formats it in html so that you can paste it in your blog

http://codeformatter.blogspot.com/

Running Apache Spark Unit Tests Sequentially with Scala Specs2

Apache Spark has a growing community in the Machine Learning and Analytics world. One of the thing that often comes up when developing with Spark is the Unit tests for functions that take in an RDD and return an RDD. There is the famous Quantified Blog on Spark Testing with FunSuite which gives a great way to design the trait class and then use it in our test classes. But it was a little outdated (written for Spark 0.6). In other words, the system.clearproperty("spark.master.port") is no longer a property that exists in Spark 1.0.1. Thankfully the Spark Summit 2014 talk on "Spark Testing: Best Practices" is based on the latest version of Spark and has the right properties to set, namely spark.driver.port and spark.hostPort. We also used org.Specifications2 (scala Specifications) and Mockito libraries for testing, so our trait class looks a little different.

 import org.specs2.Specification  
 import org.specs2.mock.Mockito  
 import org.apache.spark.SparkContext  
 trait SparkTests extends Specification{  
  var sc: SparkContext = _  
  def runTest[A](name: String)(body: => A): A = {  
   System.clearProperty("spark.driver.port")  
   System.clearProperty("spark.hostPort")  
   sc = new SparkContext("local[4]", name)  
   try{  
    println("Running test " + name)  
    body  
   }  
   finally {  
    sc.stop  
    System.clearProperty("spark.driver.port")  
    System.clearProperty("spark.hostPort")  
    sc = null  
   }  
  }  
 }  

Your actual test will extend this trait and contain the "sequential" keyword

 class LogPreprocessorSpec extends Specification with Mockito with ScalaCheck with SparkTests {  
  sequential  

Last but not the least your build.sbt will contain the following:

 testOptions in Test += Tests.Argument("sequential")  

Thursday, May 1, 2014

Building Apache Spark Jars

Apache Spark seems to have taken over the Big Data world. Apache Spark is an in-memory solution that reads data from HDFS. Ever since its first release, Spark has received much attention from the Big Data community and now has a huge fan following from academicians, researchers and the industry. Unfortunately, the documentation for the work has not been keeping up with the code development (in my opinion). There is especially one issue that took me a long time to figure out and I am hoping this post will cut-short the time spent by others trying to solve the same problem.

The Goal: I already have a cluster of spark set up on a set of machines. Let me call this the "lab" cluster. This lab cluster came pre-installed with Hadoop. I want to run spark jobs (written in scala) on this cluster. There are two ways to do it
a) Running sbt run from the root of the sbt project that contains the scala code.
b) Run the fat jar created by sbt assembly as follows

    java -cp path-to-fat-jar MainClassName

The problem: According to the guidelines provided on Spark's documentation, the sbt plugin needs to specify library dependencies to spark and to the relevant hadoop version as follows:

libraryDependencies += "org.apache.spark" %% "spark-core" % "0.9.1"
libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "your-hadoop-version"
 The issue with this is that the fat jar thus compiled now contains libraries that are already present on my cluster. If there any version difference in your hadoop dependency and in hadoop installed on your cluster you will run into a common client mismatch error

The solution:    I found this great resource finally. Here they specially mention that you need to eliminate spark as well as hadoop dependencies and the fat jar should contain nothing but the code for your application. This can be done simply by adding "provided" keyword ahead of each of the library dependency in your build.sbt or build.scala. With this change watch how your sbt excludes all spark, akka and hadoop dependencies. The final jar can be launched using the above mentioned command without any issues

Next Up:    While the cloudera page (and the selflessness of a colleague at work) saved my day, I have now run into another problem. I would like to run this jar using oozie (hadoop's job scheduler) but then again, there is very limited documentation (except for may be one hint). Hopefully I will figure out the puzzle and write another blog to help folks out there who are struggling like me.


Wednesday, March 26, 2014

Hive: Lessons learned

The past few days I have been playing with Hive for some data analysis and I wanted to put down what I learned

a) Exporting data from hive to csv

If you are using hue, then it provides a convenient way to export to csv or excel format. But if not then you can use the following preamble before the select statement

INSERT OVERWRITE LOCAL DIRECTORY '/path/out.csv' ROW FORMATTED DELIMITED FIELDS TERMINATED BY ','

b) Hive does not allow "select" statements in the "where" clause. for example

SELECT file_id, file_type
FROM file_metadatas
WHERE file_type IN (select types from allowed_file_types)

WILL NOT WORK in Hive, but will work in MySQL

Instead one can use a join 

SELECT a.file_id, a.file_type
FROM file_metadatas a
JOIN allowed_file_types b
ON b.types = a.file_type

c) Date functions: Computing a date that is ~6 months behind the current date

DATE_SUB(TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP( ))),180)

d) hive regex: Regular expressions can be used to extract parts of a text field. Here is an example
 
Extracting the file extension:  regexp_extract(file_name,'\\.[0-9a-zA-Z]+$',0))  extracts the 0th (1st) match of file_name with the input string

e) string manipulation support: Strings can be manipulated in hive. One such example is the "lower" function that converts all characters to lower