A Great Week for Apache Hadoop: Summit Roundup
On June 10th, more than 750 people from around the world descended on the Santa Clara Marriott to share their love for a little stuffed elephant named Hadoop. It was a good week to be part of this exploding community, and I want to extend Cloudera’s heartfelt thanks to everyone who made it possible, especially our friends at Yahoo! who organized this Summit. Most importantly, I want to thank all of you who were able to participate. I know many of you couldn’t make it to California this time, so I hope to see you at the Hadoop Summit East in October.
For those of you who couldn’t join us, I thought I would post my notes on a few of the highlights.
Apache Hadoop Goes Mainstream:
About 300 developers attended last year’s summit, primarily from web companies and research labs. They were joined by a few forward-thinking venture capitalists. This year’s audience was both larger and different. In addition to the vibrant developer community, there was a flood of users of Hadoop. Though the audience was still dominated by web companies, attendees included traditional enterprise users with applications ranging from finance to biotech. There were technology previews from IBM and Sun. Major companies like Amazon joined our commercial efforts around Hadoop. VCs had also stepped up to sponsor status. Take-away? You ain’t seen nothing yet.
Hadoop In Print:
Yahoo! Developer Network gave away 500 copies of Tom White’s book, “Hadoop: The Definitive Guide,” published by O’Reilly. If you missed your copy, I’ve heard that when they aren’t busy developing AWS, Amazon has been known to sell a few books here and there.
Cloudera Presentation Slides:
Several Cloudera employees spoke at the Summit, and we have posted slides from those talks on the Hadoop Wiki. If you spoke, please put your slides up as well. Here are direct links to the Cloudera talks:
- The Growing Hadoop Community, Christophe Bisciglia
- Hadoop Configuration and Deployment, Matt Massie
- Running Hadoop in the Cloud, Tom White
- Job Scheduling for Hadoop, Matei Zaharia
Cloudera Announces New Distribution Features:
We see an increasing number of users moving data between Hadoop and more traditional database products, and more and more usage moving to the cloud – especially Amazon. To that end, we’ve released two new features, and a collection of new packages, that make Hadoop easier to use.
- Sqoop: Database Import for Hadoop. Brainchild of Aaron Kimball, Sqoop is an extensible command-line tool that copies data from a relational database into Hadoop. Sqoop uses JDBC to inspect the database schema, and automatically generates all of the code necessary to move the data. It can import data from any database over JDBC, and includes an extension to allow better performance in MySQL by using the mysqldump command.
- EBS Integration for Hadoop on AWS: Tom White had a busy month. Besides finishing his book, he spent some time thinking about how Hadoop runs on Amazon Web Services, and came up with new code to make that better. Hadoop clusters on EC2 have always needed to copy data from S3 when they started up, and write results back to S3 before they powered down. While Amazon’s Elastic MapReduce makes this round-trip much easier operationally, EMR doesn’t support tools like Pig and Hive. Using Tom’s work, Cloudera is able to store data blocks on EBS volumes, and to connect them to EC2 nodes running Hadoop as needed. This delivers better throughput and more disks per node at lower cost , since EBS is cheaper than S3. Since no copies are required at startup and shutdown, your EC2 instances run for less time, saving CPU costs. Best of all, these changes to Hadoop work with Hive, Pig, Sqoop, and the rest of the Hadoop family. You can now load data, run jobs in your favorite language, turn your cluster off, and pick up exactly where you left off later. All your data survives.
- Preview Release of 0.20 Packages: Matt Massie and Todd Lipcon doubled down to get our testing release packaged so that those of you who crave the bleeding edge can start experimenting with version 0.20 of Hadooop today. Over the next few weeks, we’ll be bringing in changes from other leading Hadoop developers, upgrading our customers, and releasing stable packages to the community.
Hadoop Developer Offsite
With so many Hadoop developers in the Bay Area, we decided to invite the Hadoop committers and some active developers to Cloudera’s offices. We wanted to collaborate without the assistance of email lists, JIRA, hudson, or any other technology designed to make our lives easier. We used sticky notes to identify issues in parallel, identified consensus with clusters, and broke off into smaller teams to explore solutions. Out of this, we identified five things we love and hate about Hadoop, the biggest upcoming challenges for the project and a wish list for the future. We broke into sub-projects to make concrete plans to address these issues, and we posted the meeting notes online. We’ll continue to host such meetings, and to work with other leaders in the development community. Bottom line, as Hadoop grows up, we need to grow with it, and meetings like this are a great way to coordinate development efforts with the needs of the community.
Yahoo! Distribution of Hadoop:
Long known for their leadership in the Hadoop development community, Yahoo! stepped it up again by releasing the source code that they run on their alpha clusters to the community at large. There are some things you can only learn about Hadoop from running at Y!’s scale, and while this is not a stable production distribution, their source-only (available via github) release provides 17 patches slated for inclusion in later versions of Hadoop. Cloudera is working closely with the team at Yahoo! to fold these patches into the next release of our distribution, along with dozens of patches we have developed to support customer workloads, and a half dozen or so from our friends at Amazon to improve performance on AWS. As big players like Yahoo! and Amazon continue to open their development processes, Cloudera can deliver more stable, better tested, and ultimately, more trusted code to our enterprise customers and the community at large in the packages you know and love (RPMs, Debian Packages, AMIs, etc). It’s not always easy for big companies to be open, so we’d like to thank and congratulate everyone involved.
HBase has endured its share of criticism over the last year, but based on last week’s presentation, many of those problems have been addressed. HBase has made incredible strides in terms of reliability, availability and performance. Version 0.20 the first-ever “performance” release, and is focused on improving random access, scan and insert times. Check out these slides for details. We’re looking at an order of magnitude performance improvement, with random reads on par with traditional RDBMS. The other major improvement involves ZooKeeper integration, and eliminates the single point of failure in the master node. This strengthens the case for including HBase with the Cloudera Distribution for Hadoop. Please let us know if you want HBase support.
We had a great time at the summit – we learned a lot and got to talk to a lot of smart people. We’re looking forward to October’s Hadoop Summit East in New York City!