Friday, October 10, 2014

Data Science Lightning Talks: #GHC14

The lightning talks at GHC's Data Science track were super fun and covered a wide range of topics. Although data scientists or machine learning folks already know most of these concepts, its great to get a refresher. As a plus the passion of the speakers was contagious.

Trusting User Annotations on the Web for Cultural Heritage Domains Presenter by Archana Nottamkandath (Vrije Universiteit Amsterdam) 
Whenever labels/annotations for unlabeled data are acquired through crowdsourcing, one needs techniques to compute the quality of annotations. The talk highlighted a very relevant application of the problem of 'gauging quality of user defined annotations'. Cultural Heritage Institutions are digitizing their painting collections and collection annotations via crowd sourcing. In order to detect  annotations of poor quality of malicious intent, features like timestamp, age of the user, geolocation were used. Some form of ground truth was generated using domain experts. 
Only Trust your Data: Experiment, Measure and Ship! Presenter: Audrey Colle (Microsoft)
This was a really fun talk, no concepts here were new, but several take-aways from an experienced person involved in A/B testing. The first take away or idea is the One Factor At a Time (OFAT) approach, where one only  tweaks a few parameters of the model. The second useful advice was to run A/A tests along with A/B tests. A/A Test should give exactly same results, help test sanity of the A/B pipeline. Third advice was to use more than 1 metric and to design metrics keeping the end user in mind. Last but not the least, the speaker advised on using t-test to compare metrics coming from A/B.
Making Sense of Nonsense: Graph Mining and Fraud Detection Presenter: Sarah Powers (Oak Ridge National Laboratory) 
In this talk I learned about how Health care network can represented as a graph and how fraud detection can be done on this network. In a health-care network, nodes are insurance companies, hospitals and providers, and linkages indicate connections. Discovery of patterns, connected components and figure out if these indicate fraud or not. The speaker admitted challenges related to scaling of infrastructure, scaling of algorithms, assumptions and pattern discovery as the network data grows to become BigData. 
Power of Polka Dots – Using Data to Tell Stories Presenter: Mridula (Mita) Mahadevan (Intuit Inc.) 
Another talk about the power of networks. This time more focused on the implementation details. The speaker talked about the back-end data store the Web-app ==> Kafka ==> ETL ==> Aggregate to create aggregates and the RESTAPI for real-time access to the model parameters
Recipes of Your Favorite Restaurant Presenter: Yichi Zhang (University of Michigan - Ann Arbor) 
This was a fun end to end data science project developed by a student at Michigan. She used yelp review search api and the big oven recipe api to collect data. She talks about her challenges in data collection and how she overcame those challenges. Her insights were entertaining more than useful.

Big Data Anomaly Detection Using Machine Learning Algorithms Presenter: Yi Wang (Microsoft) 
A talk on anomaly detection in a industry setting. As is obvious, the approaches were developed and used by services team, to detect and generate anomalies for their servers and services. They analyzed tons of traffic data aggregate in real time and fed into an anomaly detection to create alerts. 
Breaking Language Barriers: Recent Successes and Challenges in Machine Translation by Marine Carpuat (National Research Council Canada) 
This talk was more of a primer on Machine Translation. The speaker progressed from using unigram approaches to create the translation dictionary and then went on to explain how bigram approaches would help in narrowing the search space by specifying the context.

Application of Advanced Machine Learning Techniques in Home Valuation Presenter: Sowmya Undapalli (Intel Corporation) 
This was yet another fun Kaggle-kind of projects where the speaker used Machine learning to build a DIY home evaluation system.  She demonstrated her code using IPython and scikit learn toolbox and shared the various classification algorithms she had used. She had two data sets, one teeny-tiny containing 300 data points in a single zip code of Tempe, AZ. The bigger dataset had 24300 data points 115 features reduced to 20 variables. All was good, except for her findings. The fact that the square footage has the biggest impact on the selling price of the house, is a little too much of an anti climax in my opinion. Still, a fun talk and a fun project to learn about

No comments:

Post a Comment