State of the Elephant 2008
It’s a new year, the time when we take a moment to look back at the previous one, and forward to what might be coming next. In the world of Hadoop a lot happened in 2008.
At the beginning of the year, Hadoop was a sub-project of Lucene. In January, Hadoop became a Top Level Project at Apache, in recognition of its success and diversity of community. This allowed sub-projects to be added, the first of which was HBase, previously a contrib project. ZooKeeper, a service for coordinating distributed systems, and which had been hosted at SourceForge, became a Hadoop sub-project in May. Then in October, Pig (a platform for analyzing large datasets) graduated from the Apache Incubator to become another Hadoop sub-project. Finally, Hive, which provides data warehousing for Hadoop, moved from being a Hadoop Core contrib project to its own sub-project in November.
The creation of new projects is a sign of healthy growth in the Hadoop eco-system. The core mailing list traffic has not shown the huge growth that it did in the proceeding year, but the developer list remains the most trafficked Apache list (both at the time of writing, and for all time). In the new year Hadoop Core will have MapReduce and HDFS extracted to become standalone sub-projects, which will help ease the traffic burden on developers and users.
Over the course of 2008 the base of Hadoop committers grew in diversity. Hadoop Core had 13 committers at the beginning of the year, from 4 distinct organizations; by the end of the year there were 21 committers from 9 distinct organizations.
- Hadoop Core had four major releases in 2008, following a quarterly release cycle: 0.16.x, 0.17.x, 0.18.x, and 0.19.x. There were also many minor releases for bug fixes.
- HBase, which has a release after every major Hadoop release, had three releases in 2008 (the 0.19.0 release will be available soon).
- ZooKeeper made one major release (3.0.x).
- Pig had one major release (0.1.x), the next one will be a fairly major re-write of the core to introduce a new types system.
- Hive is yet to do a release.
The first ever Hadoop Summit was hosted by Yahoo! at their Sunnyvale offices in March, and brought together Hadoop users and developers for a series of excellent presentations. The slides and videos are available at Yahoo! Research, and are well worth a look.
ApacheCon US 2008, held in New Orleans in November had a dedicated Hadoop track, called Hadoop Camp. There was also a Hadoop training course; and Cloudera ran a Hadoop Hack contest to let conference goers dabble in Hadoop.
The academic and research community wholeheartedly embraced Hadoop in 2008. There are now a number of institutions that use Hadoop in their courses (including Brandeis University, University of Maryland, University of Washington, Carnegie Mellon University, UC Berkeley; see Google’s Code University for more), and well as research.
Owen O’Malley wrote a MapReduce program and ran it on a 910 node Hadoop cluster to win the 2008 Terabyte Sort benchmark. The program sorted 1TB of data in 3.48 minutes (209 seconds), beating the previous record of 297 seconds. Owen provided more details on the Yahoo! Developer Blog. In November, Google announced that their MapReduce implementation sorted 1TB in 68 seconds running on 1000 machines. James Hamilton provided more analysis on the two results.
The number of open source projects in the distributed computing space continues to grow relentlessly. Here are some that came to prominence in 2008, and have some connection to Hadoop (if only because they are used in conjunction with Hadoop, or perform similar functions). In no particular order:
- Mahout, an Apache Lucene sub-project to create scalable machine learning libraries that run on Hadoop
- Jaql, a query language for JSON data
- CloudBase, a data warehouse system build on Hadoop
- Cassandra, a distributed storage service
- Cascading, an API for building dataflows for Hadoop MapReduce
- Scribe, a service for aggregating log data
- Tashi, an Apache incubator project for cloud computing for large datasets
- Disco, a MapReduce implementation in Erlang/Python
- Hypertable, a distributed data storage system, modeled on Google’s Bigtable
- CloudStore, a distributed filesystem with Hadoop integration (formerly Kosmos filesystem)
Some Predictions for 2009
- Hadoop Core 1.0 will be released. (Notice I didn’t say what it will include, or when it will be!)
- There will be new Hadoop sub-projects, possibily Core contrib modules that see wider use and are promoted.
- More projects outside Apache will build on Hadoop.
- There will be increased adoption of Hadoop outside the web domain (e.g. retail, finance, bioinformatics, etc.).
(With apologies to ApacheCon for the post title.)