Saturday, June 27, 2015

Friday night date with SparkSQL and Airpal: A stab at the Frecency algorithm

At my current company we have a pretty advanced analytics infrastructure and operations group that stay on top of things. We have Kafka, Elastic Search, HBase, RedShift, Spark (latest version) and Unravel. And yet data analysis is really hard due to the extremely slow pace of map reduce job that is launched every time we want to run rudimentary queries.

A couple of weeks ago, I attended the AirBnB conference where they announced open-sourcing Airpal. To my surprise, I found out just yesterday that my company now has installed Airpal on top our HDFS/hive cluster. I found an amazing speed up in query retrieval and obviously in 10 minutes I had exported all the data I needed for my analysis into csv files.

Anyway here goes the problem statement

Problem: Users login to your cloud storage service and log when they upload/ download/ previewed/ commented a file. Each of these events are logged in a separate table. That means you have a different table for different events each row looking like this:
 (fileid, userid, date,
Given this, the goal is to come up with a 'frequent and recent' set of files that are accessed by the user.

Solution: The first step is to consolidate all this data into one feature table (let's call it the Access Table) that has the following fields

(fileid, userid, date, number of uploads, number of downloads, number of times commented, number of times previewed)

From this access table, we can compute the frecent files by using two things (a) the number of times the file is accessed in the past x days (b) weighting the x-day old files lowest and weighting today's access highest. One can also specify relative importance of each of the activities (upload, download etc).

While this algorithm is pretty easy to implement in SQL, I wanted to see how SparkSQL would do. Since I have some basic understanding of how Spark works and Scala, I decided to take a shot at Spark SQL to play with the data as well as explore data frames. If you want the code and the data, you can find it on my github page (Disclaimer: This is my first attempt at keeping track of my analysis using github, so readability will not be perfect). Also this code uses only uploads and downloads data in the calculation.

PS: I just realized that it has been almost 9 months since my last post. Shame on me. I am sure I was never out of inspiration for the majority of this year and yet I did not put out any fun posts. Shame on me.