Excitement is building as Hadoop World nears and we are sitting down with some of our presenters to ask them a few questions regarding their presentations and how they are using Hadoop within their organization. Here we speak with Philip Kromer, President of Infochimps, who answers questions regarding his presentation, how Hadoop is used in his business, and what he aims to get out of Hadoop World. Philips presentation at Hadoop World is about the development of a data marketplace and commoditization, and their chimpanzee-style approach to data processing. Attend Hadoop World October 12th in New York to hear more from and to talk to Philip.
What can attendees expect learn about Hadoop from your presentation at Hadoop World?
We’re now able to quantify aspects of human behavior never before accessible. Twitter, the News stream, the Smart Grid, are exquisite lab instruments for measuring ‘Conversation’, ‘Interest’, ‘Activity’. What’s more, with enough data machine-learning algorithms and big data tools let us expose insight using only the *structure*, not the content of the data. The massive quantity and connectivity required demands industrial-strength tools such as Hadoop.
We do *all* our data processing in high level tools (chiefly Pig and Wukong) — “black boxes with flexible glue”. We use ‘programmer fun’ + ‘programmer time’ as our primary development metrics. Together, writing simple loosely coupled scripts lets us run the fast experiment-driven design cycles that a lean startup demands. It has also let us grow our own talent and recruit outside CS (physicists, in particular, dream in map reduce). I think this approach should have strong appeal to small- and medium-sized businesses, or anyone looking for low barrier-to-adoption of Hadoop.
Do you have Hadoop in production use today?
We have Hadoop in heavy production use for ad-hoc analysis and for automated processes digesting terabytes of data.
Can you describe some use cases for Hadoop in your business?
We have scraped data from around the web, principally Social Networks. We use Hadoop for processing it on its own and to mash it up with other open & commercial datasets.
- We have a collection of 3 billion tweets (twitter messages) from 60+million users that we tokenize into 16B+ usages of 65M terms — more than a terabyte of data on its own. Using Pig and Wukong we can identify whom to follow, to understand how events and news stories resonate, and even to find dates.
- MLB has released a dataset describing the trajectory and full game state for every pitch of every game for the past several seasons. Smashing this against the hourly weather data produces a laboratory able with the potential to describe the physics of a knuckleball or the performance for pitcher’s age vs. game-time temperature.
How do you support Hadoop?
Operationally, we use the Amazon cloud and a collection of Chef recipes (that we’ve open-sourced). These let us spin up, use, and spin down clusters of one to hundreds of machines, using either local (persistent) HDFS or just push/pull from Amazon S3.
We have also been supporting Hadoop by giving back to the Hadoop open-source community.
- Wukong (our Ruby-language toolkit for Hadoop), which we believe is the easiest and most fun way to write map-reduce programs.
- At Hadoop World we’ll be announcing Chimpmark, a target benchmark for implementers and users of big data tools. It’s a collection of large scale datasets, accompanying challenges, and reference implementations that let you profile, tune and more deeply understand your hadoop system.
- ClusterChef, the cluster management toolkit I described above.
How has Hadoop improved your business?
Most of the stuff we use Hadoop for would be otherwise impossible.
What are you hoping to get out of your time at Hadoop World?
- Learn Ideas.
- Popularize and receive feedback on the development of a data marketplace.
- Hear where the world of Big Data is going.