Friday, October 10, 2014

The Power of Context in Real-World Data Science Applications #GHC14

The Data Science in Practical Applications session at GHC Data Science Track  ranged from AI to Fraud detection and much more. You can access the notes of the session here. I elaborate a little further on these talks and add a little 2 cents of my perspective.

"AI: Return to Meaning" by David  from Bridgewater Associates. 

AI has been a long forgotten term and gained significant bad reputation before the advent of statistical Machine Learning. So this talk was refreshing in that, it highlighted the contribution AI can make given the advancements in the fields of machine learning, BigData and Data Science.

David started out by differentiating between AI and machine learning using a simple decision problem as an example. AI is traditionally a theory driven process, allowing people to make deductions or create rules. AI has been traditionally used as an interactive approach to answer questions. Statistical Machine Learning on the other hand creates "latent" rules by analyzing large amount of data using correlation analysis and optimization techniques.

David then went on to talk about how Statistical Machine learning has limitations. For one, when machine learning is used to make decisions, one loses the ability to interpret findings. In other words, it gets harder to figure out what specific variables help in decision making. Secondly, given that we trained our machine learning algorithm to solve a specific decision problem, the context is assumed in the feature extraction stage of the process. When the context changes a new machine learning algorithm needs to be trained and learned in order to answer the new decision problem.

David comes from a background at IBM where he worked on the Watson's Architecture that won the jeopardy challenge. In order to play jeopardy, you not only need a bunch of decision making algorithms at hand, you need to be able to figure out the context in which the question was posed in order to select the specific machine learning problem. An interactive session with the computer based on AI approaches can help with narrowing down the context.

David then explained how winning the Jeopardy was a 4 year effort into developing NLP techniques, Machine Learning algorithms using tons of data, lots of researchers and developers.  This talk was a refresher on AI concepts and where it continues to be relevant admist all the advancement in Data Science

SiftScience: Fraud Detection with Machine Learning, Katherine Loh

There are some data-science or machine-learning problems that are just evergreen in the industry. One such problem is Fraud Detection. According to Katherine Loh, Fraud Detection is not limited to detecting credit card fraud, but also detecting spamming users, fake sellers, abuse of promo program. She seemed to have great knowledge of the work they did to build this end-to-end system for Fraud detection and Sift Science.

Since Fraud detection is a well defined and extensively researched data science problem the various stages of the solution i.e. data collection, feature extraction, data modeling and accuracy reporting have been laid out already. For example, since Fraud detection is a binary classification problem, Naive Bayes and Logistic Regression are the goto ML approaches to solving the problem. Thats why, Katherine talked about feature extraction and feature engineering as their 'secret sauce' as opposed to complex machine learning algorithms.

Since Fraud detection means detection of fraud, there was no assumptions made about the transactions that were not labeled as fraud. Customers mainly provided feedback for transactions, labeling fraudulent transactions as "bad". They engineered over  1000+ features from various sources and talked about how they provide a test and development environment to allow for continuous mining and engineering of new and useful features, allowing for the model to learn and grow with time and allowing for custom features for different enterprises.

One of the things I loved about her talk was the detailed information she gave on the kind of features that they mined and how they eyeballed results to gain a deeper understanding of the correlation between variables. She talks about how similar IP addresses during the same duration or same credit card from geologically distributed locations are indicators of fraud. The talk also impressed upon the audience that it is more beneficial to track a user's behavior as a temporal sequence rather than treating them as a bag of isolated events. It was refreshing to see how domain knowledge was incorporated so well into their algorithm, rather than blind feature extraction.

The overall architecture of their ML algorithm highlighted HBase as the data store.  SiftScience boasted a < 200ms (real-time) response time to test-data (new transactions). They have also developed a dashboard to help fraud analysts interpret results.

Overall a great talk. However, I hoped that time would be spent on dealing with common challenges like, data-corruption due to noise, mislabeled data, duplicate label, wrong data, missing fields. I also wish more time was spent on size of their collection, language used to prototype and implement new ideas. Katherine briefly mentioned on-line learning but provided no details of the approach.

New Techniques in Road Network Comparison

This is a topic I knew little about and also the only presentation from academia. Road Networks have been used by researchers to study migration of birds, by GPS devices to correct roadmaps based on trajectories taken by cars. Comparing Road networks help in detecting changes in road trajectories and  comparing two competing road networks reconstructed to the ground truth. From there on the presentation quickly took an extremely sharp turn towards various metrics to compare road networks and there strengths and weaknesses. The talk got so mathematically so quickly that most people (including) were lost. Overall a great talk, but could have been more general and educational.
The talk then gets quickly technical, and very research oriented, which is great, but one has to understand the audience at GHC. People are here to understand the overall idea of the


Attend the 3 talks on varied topics made me realize now Research and Development is about "standing on shoulder of giants". Not all approaches will work on every single sub-problem, one needs a combination of approaches to build an effective ML approach. This is what differentiates academia and industry. In academia we are trying to prove that an algorithm is better than the one proposed by our peers that researched on the topic before us. In industry, the goal is to use these algorithms as tools to solve a real-world ML problem, by first understand the context (domain) and then using relevant features/algorithms for that domain.

No comments:

Post a Comment