CDH 4.3: Now Shipping with More Apache HBase Improvements

As you may know, Apache HBase has a vibrant community and gets a lot of contributions from developers worldwide. The collaborative development effort is so active, in fact, that a new point-release comes out about every six weeks (with the current stable branch being 0.94).

At Cloudera, we’re committed to ensuring that CDH, our open source distribution of Apache Hadoop and related projects (including HBase), ships with the results of this steady progress. Thus, CDH 4.2 was rebased on 0.94.2, as compared to its predecessor CDH 4.1, which was based on 0.92.1. CDH 4.3 has moved one step further and is rebased on 0.94.6.1.

Apart from the rebase, CDH 4.3 also has some important bug fixes backported from later versions of 0.94 and trunk. Following are some of the important features and improvements that now ship in CDH 4.2/CDH 4.3:

New features:

  • Snapshots/Metrics: As explained in “Introduction to Apache HBase Snapshots”, a user can take a snapshot of a table and restore its data/schema later. This sorely missed feature in HBase was added in CDH 4.2. CDH 4.3 has added some usability features such as snapshot metrics and in-progress task information of current commands to make it more user-friendly. (See for details: HBASE-7615, HBASE-7415.)
  • HLog Compression/Replication compatibility: HBase Replication is now compatible with HLog compression. This means a cluster with WAL compression can replicate its data to another cluster with no WAL compression, and vice versa. (See for details: HBASE-5778.)

Operability/performance improvements:

  • NN HA Support: An improved RegionServer/Master side support for failing over to a standby NameNode is in CDH 4.3. It adds a retry logic around NameNode operations and thus avoids any RegionServer/Master abort. (See for details: HBASE-8211.)
  • Lazy seek optimization: HBase now lazily reads StoreFiles while querying the data, optimizing on the number of disk seeks. This improves overall read throughput and is specifically useful for workloads that read the latest data (such as Increments). (See for details: HBASE-4465.)
  • Atomic Put/Delete per row: In case of multiple types of operations on a row (Put on some columns and Delete on other), each of them takes a separate lock of its own. These combined mutations now take a single lock, improving the overall throughput. (See for details: HBASE-3584.)

Conclusion

As described above, CDH 4.3 introduces some important HBase features/bug fixes for better usability (and is compatible with its CDH 4.x predecessors). For more information, please refer to the HBase section in the CDH4 Installation Guide.

Himanshu Vashishtha is a Software Engineer on the Platform team.

Filed under:

No Responses

Leave a comment


+ 7 = eleven