Notes From the Hackathon at Cloudera
I was positively blown away by the enthusiasm, creativity, and productivity exhibited by the participants in the CDH3b2 Hackathon. We had over twenty participants from established companies like Oracle and Akamai, stealth-mode startups and one-man consulting shops. At one point we had 9 simultaneous hacking projects going, with groups of one to five people. At the end of the day, participants voted on the most interesting project, which won a prize – an iPod Nano for each participant on that project.
The winning project used Apache Hadoop, MapReduce, Apache Pig, Apache Hive and other tools to analyze the White House Visitor Logs. Using their sharp sleuthing efforts and mad Hadoop skillz, this team of hackers was able to come to some interesting conclusions about who is up to no good. Cross referencing their findings with the News reveals that, indeed, the most frequent White House visitors are also mentioned in some of the biggest political and business news stories of the year. Ground breaking journalism? Maybe not. A fantastic demonstration of the power of Hadoop to pull value and meaning out of even the most unstructured data? Absolutely!
The second place project involved using Hadoop, Flume, Hbase and Avro to do in-depth analysis of 30 minutes of data from twitter using bigrams. This project was notable just for the sheer muscle of the hackers involved. In one day, not only did they make effective use of four (or more) major Hadoop toolsets, but they did it in a way that was effective and made sense. From scratch, mind you. SCRATCH! If this isn’t a good illustration of the power of CDH when placed in the right hands, I don’t know what is.
Another notable project analyzed twitter data, cross-referenced with data from Yelp and Foursquare. These guys assumed that tweets with locations in them are an accurate indication of human behavior, and used CDH3 to come to some interesting conclusions.
Other projects included:
- Nonnegative matrix factorization in MapReduce using Pig and Hive
- Analyzing product reviews from Toys R Us and other retailers for products that are harmful to kids
- Determining “interesting links” (using retweets, follows, etc) from twitter and storing them in Hive for query
- Integrating HTTP and Flume
- Integrating Perl and Avro
The CDH3B2 Hackathon was a great time for all involved, and an impressive display! Thanks to Accel Partners for the pizza, thanks to Rackspace for the remote cluster, and thanks to the Cloudera engineers for supporting the hackers. Most of all, thanks to the participants!