Sunday, December 4, 2016

Scala idiosyncrasies

Scala is full of idiosyncracies and by writing some of these out here I am hoping it will save someone else the time. I will keep updating this post as I find new ones:

Scala idiosyncrasy # 1:  Iterator.size is not reliable. 

Consider the following lines

  val a: Iterator[Int] = Iterator(1, 2, 3)  
  val b: List[Int] = List(1,2,3)  

A test kept failing because I tried to use Iterator.size in my code instead of List.size. Then I learned that the size of the iterator is computed as the number of entries left to the end of the list. A better way is to use convert the Iterator to a List and then compute its size. Here is an example

  object Solution extends App {  
  val a: Iterator[Int] = Iterator(1,2,3)  
  println(" size of iterator a " + a.size)  
  println(" size of iterator a " + a.size)  
  val b: List[Int] = List(1,2,3)  
  println(" size of list b " + b.size)  
  println(" size of list b " + b.size)  

  size of iterator a 3
  size of iterator a 0
  size of list b 3
  size of list b 3

So there, I learned something about Scala today 

To be continued...

Monday, March 7, 2016

Foraying into the field of Machine Learning for Ed-Tech

It has been about 3 months since I joined LinkedIn. At LinkedIn, I have been working on Relevance algorithms for Lynda. Lynda is a online-learning tool that is used by Campuses, Small and Large scale business to empower students and employees to learn software, creative, and business skills to achieve their personal and professional goals. 

Online education has really taken off in the past few years and several people predict that  campuses may no longer be necessary. While I still don't know enough about my stance on these predictions, I am really impressed by the amount of research that has already gone into studying learner engagement, retention and the factors that affect it. LinkedIn is uniquely placed in this area as it has so much more structured information compared to other online-education providers like Coursera, Udemy, Edx and SkillSoft. But I am not even worried about LinkedIn's role in this field yet. I am very curious and excited to learn about all the research that has been done by universities in understanding learners.  Sadly, the work is widely scattered over journals of Statistics, Machine Learning conferences and Education-Research venues of publication. The goal of this blogpost is to summarize all these venues where such work is published. Another useful exercise (saved for another blogspot) would be to create a taxonomy of research done in this field. In fact, it might already be done. I found this great article about where research in MOOC is headed. I was amazed to see a bunch of papers listed in this article that use LDA, HMM and other machine learning models and analytics tools to make sense of the learner engagement data.  

Learning at Scale 
Learning Analytics and Knowledge 
Special Interest Group on Computer Science Education 

Among the universities that are doing research on MOOCs, most papers are published from Stanford, UPenn, Harvard, UW-Madison, Simon Fraser University and Open University of Netherlands.

In the following blogposts, I will write more about research in this field

Saturday, June 27, 2015

Friday night date with SparkSQL and Airpal: A stab at the Frecency algorithm

At my current company we have a pretty advanced analytics infrastructure and operations group that stay on top of things. We have Kafka, Elastic Search, HBase, RedShift, Spark (latest version) and Unravel. And yet data analysis is really hard due to the extremely slow pace of map reduce job that is launched every time we want to run rudimentary queries.

A couple of weeks ago, I attended the AirBnB conference where they announced open-sourcing Airpal. To my surprise, I found out just yesterday that my company now has installed Airpal on top our HDFS/hive cluster. I found an amazing speed up in query retrieval and obviously in 10 minutes I had exported all the data I needed for my analysis into csv files.

Anyway here goes the problem statement

Problem: Users login to your cloud storage service and log when they upload/ download/ previewed/ commented a file. Each of these events are logged in a separate table. That means you have a different table for different events each row looking like this:
 (fileid, userid, date,
Given this, the goal is to come up with a 'frequent and recent' set of files that are accessed by the user.

Solution: The first step is to consolidate all this data into one feature table (let's call it the Access Table) that has the following fields

(fileid, userid, date, number of uploads, number of downloads, number of times commented, number of times previewed)

From this access table, we can compute the frecent files by using two things (a) the number of times the file is accessed in the past x days (b) weighting the x-day old files lowest and weighting today's access highest. One can also specify relative importance of each of the activities (upload, download etc).

While this algorithm is pretty easy to implement in SQL, I wanted to see how SparkSQL would do. Since I have some basic understanding of how Spark works and Scala, I decided to take a shot at Spark SQL to play with the data as well as explore data frames. If you want the code and the data, you can find it on my github page (Disclaimer: This is my first attempt at keeping track of my analysis using github, so readability will not be perfect). Also this code uses only uploads and downloads data in the calculation.

PS: I just realized that it has been almost 9 months since my last post. Shame on me. I am sure I was never out of inspiration for the majority of this year and yet I did not put out any fun posts. Shame on me. 

Friday, October 24, 2014

Difference between Machine Learning and Data Science: Student Opportunity Lab #GHC14

I was invited (by a young aspiring data scientist Sally) to run a Student Opportunity Lab (SOL) at Grace Hopper Celebration (GHC) of Women in Computing (WiC) Conference at Pheonix, AZ. SOLs are meant for networking between attendees and presenters. Each session of the SOL involves a large round-table, with two facilitators (who are believed to be experts/enthusiasts of the topic the session covers), and 10 attendees and lasts 20 minutes long.  The attendees of the table can ask any question to the facilitators related to the topic of discussion. After a session is over, there is a 10 minutes of break where attendees can network with facilitators or among each other and find another table (another topic of discussion). There are about 30-40 tables reserved for various topics ranging from security, data science to career related topics.

The topic of my table was "From Classroom to Industry: Learning to be a data scientist". Although my role at Box is that of a Software Engineer in Machine Learning, I identify myself as a data scientist who loves coding/implementing things. My excitement comes from applying well known machine learning algorithms and using data analysis to gather insights from data used in various domains. I graduated from PhD in Applied Machine Learning a year ago (Dec 2013), and I have learned that there are vast differences in the way academia views the field and how the industry views it. My main goal of SOL was to bridge this gap.

One of the most common question I was asked was " What is the difference between Machine Learning and Data Science". It seems like a silly question, but a year ago when I was finishing my PhD I would not have known the difference. But now, a year later, I see it. This is such a common question that it has a Quora page

Machine Learning is just one of the tools used by data scientists to analyze data. Data Science is an applied interdisciplinary field, where there is so much more happening than just machine learning. First,  applying machine learning to a problem in the industry is very different from that in academia. Here are the differences.

In academia, when solving a machine learning problem, the datasets are mostly publicly available and define the problem well. So the goal is to solve the problem using various machine learning tools or improve upon existing tools to solve that problem. In industry, there are no publicly available datasets. If one is lucky, data of interest can be mined from logs or relational databases. Even if that is the case, most of the data is unlabeled. So even for a data classification problem, at best one can use an unsupervised algorithm or a semi-supervised algorithms. Unlike in academia, the domain of the problem in industry maybe completely new and unknown to the research community, there by having no prior literature to start from.

Second, the scale of problems in industry are at least 10x times the problems in academia. This means, infrastructure needs to be built and present around the machine learning problem. Big data is often stored in HDFS and accessed through SQL-like interfaces called as Hive (For a more detailed article on the infrastructure required to handle big-data see (insert link)). So a Data Scientist needs to know how to access this data from HDFS.

Third, the data we mine is essentially unlabeled in the industry unless there are specific ways to log (record) click behavior of users. While this feedback can be noisy its the best bet to gauge if a model is making the right recommendations/predictions. In problems like churn analysis or lead-generation, the labels come at a heavy cost of losing a customer or losing a lead. Thus Data Scientists also need to be able to mine user's click behavior or other metrics that gauge the success of a model. Data Scientists also need to be able to

Last but not the least, Data Scientists may do a lot more with presenting their findings. In pure Machine Learning, there are pre defined metrics where one can show improved performance or not. In industry problems, these metrics may not be the end-all and be-all of the story one wishes to tell with data.

This Quora page has good discussion on the topic. I had a great time meeting people from all walks of life and sharing my experiences with them. I wanted to use this session as a way give back (or rather pay forward) to the GHC community by mentoring students, by being honest about my struggles with transition to industry from academia, by helping students watch out for pitfalls when applying for jobs/ accepting job offers in the data analytics field. A secondary and unexpected outcome of this was that I learned about my own journey as a Machine Learning person. Although I am not there yet, I am an aspiring data scientist too and I have learned and grown so much since I started my journey in 2008 in this field.

Friday, October 10, 2014

Data Science Lightning Talks: #GHC14

The lightning talks at GHC's Data Science track were super fun and covered a wide range of topics. Although data scientists or machine learning folks already know most of these concepts, its great to get a refresher. As a plus the passion of the speakers was contagious.

Trusting User Annotations on the Web for Cultural Heritage Domains Presenter by Archana Nottamkandath (Vrije Universiteit Amsterdam) 
Whenever labels/annotations for unlabeled data are acquired through crowdsourcing, one needs techniques to compute the quality of annotations. The talk highlighted a very relevant application of the problem of 'gauging quality of user defined annotations'. Cultural Heritage Institutions are digitizing their painting collections and collection annotations via crowd sourcing. In order to detect  annotations of poor quality of malicious intent, features like timestamp, age of the user, geolocation were used. Some form of ground truth was generated using domain experts. 
Only Trust your Data: Experiment, Measure and Ship! Presenter: Audrey Colle (Microsoft)
This was a really fun talk, no concepts here were new, but several take-aways from an experienced person involved in A/B testing. The first take away or idea is the One Factor At a Time (OFAT) approach, where one only  tweaks a few parameters of the model. The second useful advice was to run A/A tests along with A/B tests. A/A Test should give exactly same results, help test sanity of the A/B pipeline. Third advice was to use more than 1 metric and to design metrics keeping the end user in mind. Last but not the least, the speaker advised on using t-test to compare metrics coming from A/B.
Making Sense of Nonsense: Graph Mining and Fraud Detection Presenter: Sarah Powers (Oak Ridge National Laboratory) 
In this talk I learned about how Health care network can represented as a graph and how fraud detection can be done on this network. In a health-care network, nodes are insurance companies, hospitals and providers, and linkages indicate connections. Discovery of patterns, connected components and figure out if these indicate fraud or not. The speaker admitted challenges related to scaling of infrastructure, scaling of algorithms, assumptions and pattern discovery as the network data grows to become BigData. 
Power of Polka Dots – Using Data to Tell Stories Presenter: Mridula (Mita) Mahadevan (Intuit Inc.) 
Another talk about the power of networks. This time more focused on the implementation details. The speaker talked about the back-end data store the Web-app ==> Kafka ==> ETL ==> Aggregate to create aggregates and the RESTAPI for real-time access to the model parameters
Recipes of Your Favorite Restaurant Presenter: Yichi Zhang (University of Michigan - Ann Arbor) 
This was a fun end to end data science project developed by a student at Michigan. She used yelp review search api and the big oven recipe api to collect data. She talks about her challenges in data collection and how she overcame those challenges. Her insights were entertaining more than useful.

Big Data Anomaly Detection Using Machine Learning Algorithms Presenter: Yi Wang (Microsoft) 
A talk on anomaly detection in a industry setting. As is obvious, the approaches were developed and used by services team, to detect and generate anomalies for their servers and services. They analyzed tons of traffic data aggregate in real time and fed into an anomaly detection to create alerts. 
Breaking Language Barriers: Recent Successes and Challenges in Machine Translation by Marine Carpuat (National Research Council Canada) 
This talk was more of a primer on Machine Translation. The speaker progressed from using unigram approaches to create the translation dictionary and then went on to explain how bigram approaches would help in narrowing the search space by specifying the context.

Application of Advanced Machine Learning Techniques in Home Valuation Presenter: Sowmya Undapalli (Intel Corporation) 
This was yet another fun Kaggle-kind of projects where the speaker used Machine learning to build a DIY home evaluation system.  She demonstrated her code using IPython and scikit learn toolbox and shared the various classification algorithms she had used. She had two data sets, one teeny-tiny containing 300 data points in a single zip code of Tempe, AZ. The bigger dataset had 24300 data points 115 features reduced to 20 variables. All was good, except for her findings. The fact that the square footage has the biggest impact on the selling price of the house, is a little too much of an anti climax in my opinion. Still, a fun talk and a fun project to learn about

The Power of Context in Real-World Data Science Applications #GHC14

The Data Science in Practical Applications session at GHC Data Science Track  ranged from AI to Fraud detection and much more. You can access the notes of the session here. I elaborate a little further on these talks and add a little 2 cents of my perspective.

"AI: Return to Meaning" by David  from Bridgewater Associates. 

AI has been a long forgotten term and gained significant bad reputation before the advent of statistical Machine Learning. So this talk was refreshing in that, it highlighted the contribution AI can make given the advancements in the fields of machine learning, BigData and Data Science.

David started out by differentiating between AI and machine learning using a simple decision problem as an example. AI is traditionally a theory driven process, allowing people to make deductions or create rules. AI has been traditionally used as an interactive approach to answer questions. Statistical Machine Learning on the other hand creates "latent" rules by analyzing large amount of data using correlation analysis and optimization techniques.

David then went on to talk about how Statistical Machine learning has limitations. For one, when machine learning is used to make decisions, one loses the ability to interpret findings. In other words, it gets harder to figure out what specific variables help in decision making. Secondly, given that we trained our machine learning algorithm to solve a specific decision problem, the context is assumed in the feature extraction stage of the process. When the context changes a new machine learning algorithm needs to be trained and learned in order to answer the new decision problem.

David comes from a background at IBM where he worked on the Watson's Architecture that won the jeopardy challenge. In order to play jeopardy, you not only need a bunch of decision making algorithms at hand, you need to be able to figure out the context in which the question was posed in order to select the specific machine learning problem. An interactive session with the computer based on AI approaches can help with narrowing down the context.

David then explained how winning the Jeopardy was a 4 year effort into developing NLP techniques, Machine Learning algorithms using tons of data, lots of researchers and developers.  This talk was a refresher on AI concepts and where it continues to be relevant admist all the advancement in Data Science

SiftScience: Fraud Detection with Machine Learning, Katherine Loh

There are some data-science or machine-learning problems that are just evergreen in the industry. One such problem is Fraud Detection. According to Katherine Loh, Fraud Detection is not limited to detecting credit card fraud, but also detecting spamming users, fake sellers, abuse of promo program. She seemed to have great knowledge of the work they did to build this end-to-end system for Fraud detection and Sift Science.

Since Fraud detection is a well defined and extensively researched data science problem the various stages of the solution i.e. data collection, feature extraction, data modeling and accuracy reporting have been laid out already. For example, since Fraud detection is a binary classification problem, Naive Bayes and Logistic Regression are the goto ML approaches to solving the problem. Thats why, Katherine talked about feature extraction and feature engineering as their 'secret sauce' as opposed to complex machine learning algorithms.

Since Fraud detection means detection of fraud, there was no assumptions made about the transactions that were not labeled as fraud. Customers mainly provided feedback for transactions, labeling fraudulent transactions as "bad". They engineered over  1000+ features from various sources and talked about how they provide a test and development environment to allow for continuous mining and engineering of new and useful features, allowing for the model to learn and grow with time and allowing for custom features for different enterprises.

One of the things I loved about her talk was the detailed information she gave on the kind of features that they mined and how they eyeballed results to gain a deeper understanding of the correlation between variables. She talks about how similar IP addresses during the same duration or same credit card from geologically distributed locations are indicators of fraud. The talk also impressed upon the audience that it is more beneficial to track a user's behavior as a temporal sequence rather than treating them as a bag of isolated events. It was refreshing to see how domain knowledge was incorporated so well into their algorithm, rather than blind feature extraction.

The overall architecture of their ML algorithm highlighted HBase as the data store.  SiftScience boasted a < 200ms (real-time) response time to test-data (new transactions). They have also developed a dashboard to help fraud analysts interpret results.

Overall a great talk. However, I hoped that time would be spent on dealing with common challenges like, data-corruption due to noise, mislabeled data, duplicate label, wrong data, missing fields. I also wish more time was spent on size of their collection, language used to prototype and implement new ideas. Katherine briefly mentioned on-line learning but provided no details of the approach.

New Techniques in Road Network Comparison

This is a topic I knew little about and also the only presentation from academia. Road Networks have been used by researchers to study migration of birds, by GPS devices to correct roadmaps based on trajectories taken by cars. Comparing Road networks help in detecting changes in road trajectories and  comparing two competing road networks reconstructed to the ground truth. From there on the presentation quickly took an extremely sharp turn towards various metrics to compare road networks and there strengths and weaknesses. The talk got so mathematically so quickly that most people (including) were lost. Overall a great talk, but could have been more general and educational.
The talk then gets quickly technical, and very research oriented, which is great, but one has to understand the audience at GHC. People are here to understand the overall idea of the


Attend the 3 talks on varied topics made me realize now Research and Development is about "standing on shoulder of giants". Not all approaches will work on every single sub-problem, one needs a combination of approaches to build an effective ML approach. This is what differentiates academia and industry. In academia we are trying to prove that an algorithm is better than the one proposed by our peers that researched on the topic before us. In industry, the goal is to use these algorithms as tools to solve a real-world ML problem, by first understand the context (domain) and then using relevant features/algorithms for that domain.

Friday, August 1, 2014

Adding code to your blogger blog

A big shout out to those who wrote this awesome tool that takes in code and formats it in html so that you can paste it in your blog