Apache Hadoop in 2011
2011 was a breakthrough year for Apache Hadoop as many more mainstream organizations large and small turned to Hadoop to manage and process Big Data, while enterprise software and hardware vendors have also made Hadoop a prominent part of their offerings. Big Data and Hadoop became synonymous in much of the enterprise discourse, and Big Data interest is not restricted to Big Companies.
Apache Hadoop Releases
Hadoop had three major releases in 2011: 1.0 (AKA 0.20.205.x), 0.22, and 0.23.
1.0.0 adds HDFS support for HBase, Webhdfs, and HDFS performance improvements
0.23 includes performance improvements, Name Node federation, and support for job scheduling and execution models other than MapReduce; it is not yet ready for production use
0.22 is a branch release based on 0.21 with some of the features of 1.0.0
Hadoop Family Releases
ZooKeeper (distributed lock manager) provided an update to version 3.3 and released version 3.4 which includes Kerberos client authentication and multiupdate support.
HBase (distributed key-value store on HDFS) saw updates to 0.90; 0.92 didn’t quite make it out by the end of the year but immediately after with coprocessors for adding custom code around queries, security, a new file format, and distributed log splitting.
Avro (data serialization format) did 1.5 and 1.6 releases, including a .NET implementation, support for Snappy compression, a builder API, asynchronous RPC, reading and writing Protobuf and Thrift data structures in Avro data files, and performance and schema resolution improvements.
Hive (SQL-like interface to Hadoop) had an update release 0.7.1 and a major release 0.8 including bitmap indexes, the TIMESTAMP data type, a plugin developer kit, and JDBC driver improvements.
Mahout (machine learning with Hadoop) did a major release 0.5 with improved Lanczos solver, LDA improvements, a Stochastic Singular Value Decomposition implementation, an incremental SVD implementation, and an Alternating Least Squares with Weighted Regularization collaborative filtering implementation.
Pig (data flow language for MapReduce) had an update to 0.8 and a major release, 0.9 which introduced control structures, changed the query parser, and performed semantic cleanup.
Cassandra (distributed key-value store) released 0.7, 0.8, and 1.0.6, including data compression and increased read and write performance.
Hama (bulk synchronous parallel computing for e.g. matrix, graph, and network algorithms) is an incubator project that did two major releases: 0.2 and 0.3.
Whirr (a set of libraries for running cloud services, with an emphasis on Hadoop-related services) graduated to a TLP (top-level project) and made four releases, including support for ElasticSearch, Voldemort, BYON, and HBase.
Projects that joined the Apache incubator in 2011
Flume (streaming data into HDFS); the Flume NG (next generation) project was launched to provide increased robustness.
Accumulo (distributed key-value store on HDFS).
Bigtop (packaging, deployment, and test framework for Hadoop) made two releases and helped with testing Hadoop releases 1.0, 0.22, and 0.23 in combination with other ecosystem components. Bigtop was used to help validate and test all the new major Hadoop releases.
Giraph (graph processing with Hadoop following the bulk-synchronous parallel model relative to graphs where vertices can send messages to other vertices during a given superstep).
HCatalog (extension of Hive metadata store) released version 0.2 providing read and write capability for Pig and Hadoop, and read capability for Hive.
Kafka (a distributed publish-subscribe messaging system) released versions 0.6 and 0.7.
Oozie (workflow management for MapReduce, Pig, and Hive jobs); it released versions 3.0 and 3.1 which added support for “bundles” – sets of workflows managed together.
S4 (streaming data processing) released version 0.3.
Sqoop (data transfer between HDFS and relational databases) released version 1.4 with among other things customized type mapping.
Besides the very well-attended Hadoop-specific conferences Hadoop Summit and Hadoop World, many conferences had significant Hadoop sections or sessions in 2011, including ApacheCon, Strata, Cloud Computing Expo, Chicago Data Summit, Structure Big Data 2011, and Oscon.
Microsoft dropped Dryad and will be supporting Hadoop in 2012.
Many other large companies announced major Hadoop initiatives during the year, including Oracle, Dell, HP, IBM, Informatica, and NetApp.
Other Hadoop indicators
In December 2011 the percentage of all job postings analyzed by Indeed that mentioned Hadoop; was twice what it was in December 2010.
The “Powered By” list at http://wiki.apache.org/hadoop/PoweredBy went from 108 to 157 entries during the year.
The Hadoop family mailing lists saw about 72k messages in 2010, 101k messages in 2011 (going by markmail.org).
Ten new Hadoop committers were recognized by the community.
Predictions for 2012
In 2012 we’ll see HDFS become highly available. The Name Node will be shadowed with a hot standby (most of this work was done in 2011).
MapReduce 2 / yarn will stabilize and support clusters larger than 4,000 nodes.
More sophisticated job scheduling will allow adherence to strict SLAs (service level agreements) on start and completion while better utilizing cluster resources.
Hadoop will be used more extensively where there are real-time access requirements through HBase and other specialized interfaces.
HBase 0.92 with support for coprocessors (triggers or custom code executed around a query) will be released and there will be a flurry of coprocessor contrib packages to support secondary indexes, efficient aggregations, and many other optimizations.
It will become easier to set up and automate the flow of data into and out of HDFS without significant administrative overhead.
The number of BI offerings for Hadoop and their adoption will increase markedly.
Close attention will be paid to Hadoop and HBase uptime as they are deployed more widely in mission-critical roles.