Apache Hadoop in 2013: The State of the Platform
For several good reasons, 2013 is a Happy New Year for Apache Hadoop enthusiasts.
In 2012, we saw continued progress on developing the next generation of the MapReduce processing framework (MRv2), work that will bear fruit this year. HDFS experienced major progress toward becoming a lights-out, fully enterprise-ready distributed filesystem with the addition of high availability features and increased performance. And a hint of the future of the Hadoop platform was provided with the Beta release of Cloudera Impala, a real-time query engine for analytics across HDFS and Apache HBase data.
Let’s look at the highlights of the 2012 developments around projects supported by Cloudera.
Apache Hadoop Releases
The Hadoop 1 code line produced four bug fix releases and two minor releases (1.1.0 and 1.1.1). Among the new features was the addition of snappy compression/decompression.
Hadoop 2 saw a renaming from 0.23 and four bug fix releases. Major elements of the releases:
- Support/integration for HBase, Pig, Oozie, and other Hadoop family members
- HDFS High Availability (HA)
- HDFS Federation
- Performance enhancements in both MapReduce and HDFS
The 0.23 code line (Yahoo only) saw four minor updates. That code line does not include HDFS HA.
Hadoop Family Releases
Apache HBase (distributed key-value store on HDFS) offered two major releases and five minor releases:
- 0.92 – Security, coprocessors, a new and improved storage format, distributed log splitting
- 0.94 – Performance, MultiGet functionality for increments and appends, online automated table and region repair
Apache Avro (data serialization) saw six bug fix releases.
Apache Hive (SQL-like interface to Hadoop) had one minor release and one major release:
- 0.9 – Access primitive binary types in HBase, BETWEEN, several useful UDFs
Apache Mahout (machine learning with Hadoop) produced two major releases:
- 0.6 – Implementations of AVF algorithm, Lucene filter for Collocations, Conjugate Gradient for solving large linear systems, Online Passive Aggressive learner, Random Walk with Restarts, and many more
- 0.7 – Outlier removal capability in K-Means, Fuzzy K, Canopy and Dirichlet Clustering, New Clustering implementation for K-Means, Implicit Alternating Least Squares SVD
Apache Pig (data flow language for Hadoop) had one minor and one major release:
- 0.10 – boolean datatype, nested cross/foreach, JRuby UDF, limit by expression, split default destination, tuple/bag/map syntax support and map-side aggregation
Apache Hama (bulk synchronous parallel computing for e.g. matrix, graph, and network algorithms) graduated from the Apache Incubator and provided four major releases. Major additions:
- Streaming, K-Means, gradient descent BSP, Sparse Matrix-Vector multiplication, partitioned files
Apache Whirr (libraries for running Cloud services) had one major and one minor release:
- 0.8 – Support EC2 Cluster Compute Groups for Hadoop, CloudStack; Solr as a service
Apache Flume (streaming data into Hadoop) graduated from the Incubator and produced four minor releases of the rewritten high performance Flume NG
Apache Bigtop (packaging, deployment, and test framework for Hadoop) graduated from the Incubator and had three major releases:
- 0.3 – Apache Hadoop 1.0
- 0.4 – bootable ISO, box grinder appliance, HDFS HA name nodes, Apache Giraph, Hue
- 0.5 – Apache Solr, Apache Crunch (incubating), DataFu
Apache Giraph (graph processing with Hadoop) graduated to a Top Level Project after releasing 0.1 as an Incubator project.
Apache HCatalog (extension of Hive MetaStore) released 0.4.0 as an Incubator project.
Apache Oozie (workflow management for MapReduce, Hive, Pig, and other Hadoop jobs) graduated from the Incubator and provided one bug fix release and two minor releases:
- 3.2.0 – Hive, Sqoop, and Shell actions, Kerberos SPNEGO authentication
- 3.3.0 – Direct submission of MapReduce jobs, parameterization in workflow and coordinator jobs
Apache Sqoop (data transfer between Hadoop and relational databases) graduated from the Incubator and did three bug fix releases along with a first cut of the next-generation Sqoop 2 (client-server).
Apache Crunch (Java library for MapReduce pipelines) provided two minor releases as an Incubator project:
- 0.3 – Map-side joins
- 0.4 – Bloom filters, read from database, launch pipeline from a REPL
Hue (Web interface to Hadoop, Hive, Impala, Oozie, Pig) produced a major release 2.0 and a minor release 2.1 on Github:
- Redesigned full-page paradigm, LDAP authentication, per-application authorization, Oozie workflow/coordinator dashboard and designer, localization for eight languages
- 2012 saw the first ever HBase conference – HBaseCon in San Francisco in May – with over 600 participants.
- Hadoop World merged with Strata East and had 2,500 attendees – almost twice the number of Strata East last year – and it sold out in advance.
- Hadoop Summit went from over 1,600 in 2011 to over 2,100 in 2012.
- Hadoop was featured at many other conferences during the year, including GigaOM Structure, CeBIT Big Data, DataWeek, Berlin BuzzWords, ApacheCon, Cloud Computing Expo, OSCON, Strange Loop and CloudCon.
Other Hadoop News
- Hadoop won the “Duke’s Choice” award for “extreme innovation” in Java.
- 10gen announced support for Hadoop, as did VMware/SpringSource, Splunk, Revolution Analytics, SAS, TIBCO, QlikTech and others.
- Cloudera announced the public Beta of Impala, the first real-time SQL query interface to HDFS and HBase data.
Other Hadoop Indicators
- The percentage of all job postings analyzed by Indeed that mentioned Hadoop almost doubled again over 2011, as it did the year before.
- The Hadoop family mailing lists saw about 101k messages in 2011, 129k in 2012 (going by markmail.org).
- The “Powered By” list at http://wiki.apache.org/hadoop/PoweredBy went from 157 to 165 entries during the year.
- Ten new committers were added to core Hadoop in 2012, many more to the various Hadoop family projects.
Rob runs the Platform Engineering team at Cloudera.