Kevin Weil, Analytics Lead at Twitter and a featured presenter at Hadoop World was gracious enough to sit down with us and answer a few questions in this Q&A format interview. From the information Kevin has provided you will gain a larger understanding into the Hadoop ecosystem at Twitter, Kevins presentation at Hadoop World, and what he expects to gain from attending Hadoop World.
Q: What can attendees expect learn about Hadoop from your presentation at Hadoop World?
Twitter uses Hadoop for a very broad set of applications. We have Hadoop jobs doing everything from user and product analysis, to social graph analysis, to generating indices for people search, to machine learning, and natural language processing. Nearly every team at Twitter uses Hadoop in some way. We even have folks from our product marketing and sales team running Hadoop jobs! In addition, we are very active in the Hadoop community; we employ committers on Hadoop, Pig, Hive, Avro, and Mahout. And we make the vast majority of our work on top of Hadoop open source: see our LZO compression work or our Elephant Bird project, both of which have seen great contributions from the community. Hadoop plays a critical role in helping us understand the Twitter ecosystem, and I’m going to touch on many of these use cases.
Q: Do you have Hadoop in production use today?
Yes, we have a cluster of about 100 nodes, which is actually far too small for the 8 terabytes of data generated each day across our systems. As we move to our new data center, we expect the size of our cluster to grow significantly, reflecting its importance to our company.
Q: Describe use cases for Hadoop at Twitter.
Hadoop is our data warehouse; every piece of data we store is archived in HDFS. We use HBase for data that sees updates frequently, or data we occasionally need low-latency access to. Every node in our cluster runs HBase. We use Java MapReduce for simple jobs, or jobs which have tight performance requirements. We use Pig for most of our analysis jobs, because its flexibility helps us iterate rapidly to arrive at the right way of looking at the data.
Our Hadoop use is also evolving: initially it was primarily used as an analysis tool to help us better understand the Twitter ecosystem, and that’s not going to change. But it’s increasingly used to build parts of products you use on the site every day such as People Search, the data for which is built with Hadoop. There are many more products like this in development.
Q: How do you support Hadoop?
We are fortunate to have a team of folks with significant Hadoop experience, and many others across the company eager to dive in and learn more. We use Puppet at Twitter to distribute configuration to all nodes, we use Cloudera’s Hadoop RPMs to maintain this configuration, and we use Ganglia and Nagios to monitor and alert on issues.
Q: What benefits do you see from Hadoop?
The Twitter ecosystem is growing at fantastic rates, and there is a corresponding growth in the size of our dataset. Hadoop has been instrumental in helping us analyze and understand that data for a few reasons. Its horizontal scalability has meant that as our dataset has grown, we have been able to simply add new nodes to our cluster. The fact that Hadoop takes care of scaling this way has meant that we can concentrate on the hard parts of the analysis, rather than having to update our code every time our data set grows by an order of magnitude. Additionally, Hadoop’s flexibility in analyzing semi-structured data has been important to us again and again. As our team grows, more people and teams are writing data into Hadoop, and we see fast evolution in the type and amount of data logged. Many storage systems would have choked innovation by requiring teams to wait for a schema update. Hadoop works well with rapidly evolving data, and we think projects like Howl will only make this easier.
Q: What did you use before Hadoop?
Hadoop has quickly become an essential part of our data storage, analytics, product development, and research efforts. Prior to Hadoop, we had a MySQL-based data warehouse and ETL system, like many companies start with. It worked for a while, but over time the daily job began taking 16, 18, 20 hours. That’s never been an issue since we switched to Hadoop because it allows us to scale our cluster horizontally as Twitter usage grows. It would probably take 2 weeks to run a day’s worth of numbers today if we had to go back to our old system.
Q: How has Hadoop improved your work at Twitter?
You can judge a lot about the way your knowledge of a system is evolving by the types of questions you’re asking. Before Hadoop, we were straining to answer simple questions like “how many unique users tweeted over the last 6 months?” or “What is the distribution of users across our website, our mobile site, mobile Twitter applications, and desktop-based Twitter applications?” With Hadoop, those kinds of questions are simple to answer. Answers to these questions become part of the daily lexicon, and the questions get replaced by better, harder, more informative questions involving AB testing, cohort analysis, and machine learning. It’s no longer hard to find the answer to a given question; the hard part is finding the right question. And as questions evolve, we gain better insight into our ecosystem and our business.
Q: What are you hoping to get out of your time at Hadoop World?
I’m looking forward to hearing about innovative uses of Hadoop from the other companies in attendance, catching up with old friends, and getting to know more of the great Hadoop community!