Friday, October 24, 2014

Difference between Machine Learning and Data Science: Student Opportunity Lab #GHC14

I was invited (by a young aspiring data scientist Sally) to run a Student Opportunity Lab (SOL) at Grace Hopper Celebration (GHC) of Women in Computing (WiC) Conference at Pheonix, AZ. SOLs are meant for networking between attendees and presenters. Each session of the SOL involves a large round-table, with two facilitators (who are believed to be experts/enthusiasts of the topic the session covers), and 10 attendees and lasts 20 minutes long.  The attendees of the table can ask any question to the facilitators related to the topic of discussion. After a session is over, there is a 10 minutes of break where attendees can network with facilitators or among each other and find another table (another topic of discussion). There are about 30-40 tables reserved for various topics ranging from security, data science to career related topics.

The topic of my table was "From Classroom to Industry: Learning to be a data scientist". Although my role at Box is that of a Software Engineer in Machine Learning, I identify myself as a data scientist who loves coding/implementing things. My excitement comes from applying well known machine learning algorithms and using data analysis to gather insights from data used in various domains. I graduated from PhD in Applied Machine Learning a year ago (Dec 2013), and I have learned that there are vast differences in the way academia views the field and how the industry views it. My main goal of SOL was to bridge this gap.

One of the most common question I was asked was " What is the difference between Machine Learning and Data Science". It seems like a silly question, but a year ago when I was finishing my PhD I would not have known the difference. But now, a year later, I see it. This is such a common question that it has a Quora page

Machine Learning is just one of the tools used by data scientists to analyze data. Data Science is an applied interdisciplinary field, where there is so much more happening than just machine learning. First,  applying machine learning to a problem in the industry is very different from that in academia. Here are the differences.

In academia, when solving a machine learning problem, the datasets are mostly publicly available and define the problem well. So the goal is to solve the problem using various machine learning tools or improve upon existing tools to solve that problem. In industry, there are no publicly available datasets. If one is lucky, data of interest can be mined from logs or relational databases. Even if that is the case, most of the data is unlabeled. So even for a data classification problem, at best one can use an unsupervised algorithm or a semi-supervised algorithms. Unlike in academia, the domain of the problem in industry maybe completely new and unknown to the research community, there by having no prior literature to start from.

Second, the scale of problems in industry are at least 10x times the problems in academia. This means, infrastructure needs to be built and present around the machine learning problem. Big data is often stored in HDFS and accessed through SQL-like interfaces called as Hive (For a more detailed article on the infrastructure required to handle big-data see (insert link)). So a Data Scientist needs to know how to access this data from HDFS.

Third, the data we mine is essentially unlabeled in the industry unless there are specific ways to log (record) click behavior of users. While this feedback can be noisy its the best bet to gauge if a model is making the right recommendations/predictions. In problems like churn analysis or lead-generation, the labels come at a heavy cost of losing a customer or losing a lead. Thus Data Scientists also need to be able to mine user's click behavior or other metrics that gauge the success of a model. Data Scientists also need to be able to

Last but not the least, Data Scientists may do a lot more with presenting their findings. In pure Machine Learning, there are pre defined metrics where one can show improved performance or not. In industry problems, these metrics may not be the end-all and be-all of the story one wishes to tell with data.

This Quora page has good discussion on the topic. I had a great time meeting people from all walks of life and sharing my experiences with them. I wanted to use this session as a way give back (or rather pay forward) to the GHC community by mentoring students, by being honest about my struggles with transition to industry from academia, by helping students watch out for pitfalls when applying for jobs/ accepting job offers in the data analytics field. A secondary and unexpected outcome of this was that I learned about my own journey as a Machine Learning person. Although I am not there yet, I am an aspiring data scientist too and I have learned and grown so much since I started my journey in 2008 in this field.



No comments:

Post a Comment