My Research Diaries: GHC14

The lightning talks at GHC's Data Science track were super fun and covered a wide range of topics. Although data scientists or machine learning folks already know most of these concepts, its great to get a refresher. As a plus the passion of the speakers was contagious.

Trusting User Annotations on the Web for Cultural Heritage Domains Presenter by Archana Nottamkandath (Vrije Universiteit Amsterdam)

Whenever labels/annotations for unlabeled data are acquired through crowdsourcing, one needs techniques to compute the quality of annotations. The talk highlighted a very relevant application of the problem of 'gauging quality of user defined annotations'. Cultural Heritage Institutions are digitizing their painting collections and collection annotations via crowd sourcing. In order to detect annotations of poor quality of malicious intent, features like timestamp, age of the user, geolocation were used. Some form of ground truth was generated using domain experts.

Only Trust your Data: Experiment, Measure and Ship! Presenter: Audrey Colle (Microsoft) Audreyc@microsoft.com

This was a really fun talk, no concepts here were new, but several take-aways from an experienced person involved in A/B testing. The first take away or idea is the One Factor At a Time (OFAT) approach, where one only tweaks a few parameters of the model. The second useful advice was to run A/A tests along with A/B tests. A/A Test should give exactly same results, help test sanity of the A/B pipeline. Third advice was to use more than 1 metric and to design metrics keeping the end user in mind. Last but not the least, the speaker advised on using t-test to compare metrics coming from A/B.

Making Sense of Nonsense: Graph Mining and Fraud Detection Presenter: Sarah Powers (Oak Ridge National Laboratory)

In this talk I learned about how Health care network can represented as a graph and how fraud detection can be done on this network. In a health-care network, nodes are insurance companies, hospitals and providers, and linkages indicate connections. Discovery of patterns, connected components and figure out if these indicate fraud or not. The speaker admitted challenges related to scaling of infrastructure, scaling of algorithms, assumptions and pattern discovery as the network data grows to become BigData.

Power of Polka Dots – Using Data to Tell Stories Presenter: Mridula (Mita) Mahadevan (Intuit Inc.)

Another talk about the power of networks. This time more focused on the implementation details. The speaker talked about the back-end data store the Web-app ==> Kafka ==> ETL ==> Aggregate to create aggregates and the RESTAPI for real-time access to the model parameters

Recipes of Your Favorite Restaurant Presenter: Yichi Zhang (University of Michigan - Ann Arbor)

This was a fun end to end data science project developed by a student at Michigan. She used yelp review search api and the big oven recipe api to collect data. She talks about her challenges in data collection and how she overcame those challenges. Her insights were entertaining more than useful.

Big Data Anomaly Detection Using Machine Learning Algorithms Presenter: Yi Wang (Microsoft)

A talk on anomaly detection in a industry setting. As is obvious, the approaches were developed and used by services team, to detect and generate anomalies for their servers and services. They analyzed tons of traffic data aggregate in real time and fed into an anomaly detection to create alerts.

Breaking Language Barriers: Recent Successes and Challenges in Machine Translation by Marine Carpuat (National Research Council Canada)

This talk was more of a primer on Machine Translation. The speaker progressed from using unigram approaches to create the translation dictionary and then went on to explain how bigram approaches would help in narrowing the search space by specifying the context.

Application of Advanced Machine Learning Techniques in Home Valuation Presenter: Sowmya Undapalli (Intel Corporation)

This was yet another fun Kaggle-kind of projects where the speaker used Machine learning to build a DIY home evaluation system. She demonstrated her code using IPython and scikit learn toolbox and shared the various classification algorithms she had used. She had two data sets, one teeny-tiny containing 300 data points in a single zip code of Tempe, AZ. The bigger dataset had 24300 data points 115 features reduced to 20 variables. All was good, except for her findings. The fact that the square footage has the biggest impact on the selling price of the house, is a little too much of an anti climax in my opinion. Still, a fun talk and a fun project to learn about

The Data Science in Practical Applications session at GHC Data Science Track ranged from AI to Fraud detection and much more. You can access the notes of the session here. I elaborate a little further on these talks and add a little 2 cents of my perspective.

"AI: Return to Meaning" by David from Bridgewater Associates.

AI has been a long forgotten term and gained significant bad reputation before the advent of statistical Machine Learning. So this talk was refreshing in that, it highlighted the contribution AI can make given the advancements in the fields of machine learning, BigData and Data Science.

David started out by differentiating between AI and machine learning using a simple decision problem as an example. AI is traditionally a theory driven process, allowing people to make deductions or create rules. AI has been traditionally used as an interactive approach to answer questions. Statistical Machine Learning on the other hand creates "latent" rules by analyzing large amount of data using correlation analysis and optimization techniques.

David then went on to talk about how Statistical Machine learning has limitations. For one, when machine learning is used to make decisions, one loses the ability to interpret findings. In other words, it gets harder to figure out what specific variables help in decision making. Secondly, given that we trained our machine learning algorithm to solve a specific decision problem, the context is assumed in the feature extraction stage of the process. When the context changes a new machine learning algorithm needs to be trained and learned in order to answer the new decision problem.

David comes from a background at IBM where he worked on the Watson's Architecture that won the jeopardy challenge. In order to play jeopardy, you not only need a bunch of decision making algorithms at hand, you need to be able to figure out the context in which the question was posed in order to select the specific machine learning problem. An interactive session with the computer based on AI approaches can help with narrowing down the context.

David then explained how winning the Jeopardy was a 4 year effort into developing NLP techniques, Machine Learning algorithms using tons of data, lots of researchers and developers. This talk was a refresher on AI concepts and where it continues to be relevant admist all the advancement in Data Science

SiftScience: Fraud Detection with Machine Learning, Katherine Loh

There are some data-science or machine-learning problems that are just evergreen in the industry. One such problem is Fraud Detection. According to Katherine Loh, Fraud Detection is not limited to detecting credit card fraud, but also detecting spamming users, fake sellers, abuse of promo program. She seemed to have great knowledge of the work they did to build this end-to-end system for Fraud detection and Sift Science.

Since Fraud detection is a well defined and extensively researched data science problem the various stages of the solution i.e. data collection, feature extraction, data modeling and accuracy reporting have been laid out already. For example, since Fraud detection is a binary classification problem, Naive Bayes and Logistic Regression are the goto ML approaches to solving the problem. Thats why, Katherine talked about feature extraction and feature engineering as their 'secret sauce' as opposed to complex machine learning algorithms.

Since Fraud detection means detection of fraud, there was no assumptions made about the transactions that were not labeled as fraud. Customers mainly provided feedback for transactions, labeling fraudulent transactions as "bad". They engineered over 1000+ features from various sources and talked about how they provide a test and development environment to allow for continuous mining and engineering of new and useful features, allowing for the model to learn and grow with time and allowing for custom features for different enterprises.

One of the things I loved about her talk was the detailed information she gave on the kind of features that they mined and how they eyeballed results to gain a deeper understanding of the correlation between variables. She talks about how similar IP addresses during the same duration or same credit card from geologically distributed locations are indicators of fraud. The talk also impressed upon the audience that it is more beneficial to track a user's behavior as a temporal sequence rather than treating them as a bag of isolated events. It was refreshing to see how domain knowledge was incorporated so well into their algorithm, rather than blind feature extraction.

The overall architecture of their ML algorithm highlighted HBase as the data store. SiftScience boasted a < 200ms (real-time) response time to test-data (new transactions). They have also developed a dashboard to help fraud analysts interpret results.

Overall a great talk. However, I hoped that time would be spent on dealing with common challenges like, data-corruption due to noise, mislabeled data, duplicate label, wrong data, missing fields. I also wish more time was spent on size of their collection, language used to prototype and implement new ideas. Katherine briefly mentioned on-line learning but provided no details of the approach.

New Techniques in Road Network Comparison

This is a topic I knew little about and also the only presentation from academia. Road Networks have been used by researchers to study migration of birds, by GPS devices to correct roadmaps based on trajectories taken by cars. Comparing Road networks help in detecting changes in road trajectories and comparing two competing road networks reconstructed to the ground truth. From there on the presentation quickly took an extremely sharp turn towards various metrics to compare road networks and there strengths and weaknesses. The talk got so mathematically so quickly that most people (including) were lost. Overall a great talk, but could have been more general and educational.
The talk then gets quickly technical, and very research oriented, which is great, but one has to understand the audience at GHC. People are here to understand the overall idea of the

Perspectives

Attend the 3 talks on varied topics made me realize now Research and Development is about "standing on shoulder of giants". Not all approaches will work on every single sub-problem, one needs a combination of approaches to build an effective ML approach. This is what differentiates academia and industry. In academia we are trying to prove that an algorithm is better than the one proposed by our peers that researched on the topic before us. In industry, the goal is to use these algorithms as tools to solve a real-world ML problem, by first understand the context (domain) and then using relevant features/algorithms for that domain.

My Research Diaries

Friday, October 10, 2014

Data Science Lightning Talks: #GHC14

The Power of Context in Real-World Data Science Applications #GHC14