Apache HBase 0.94.0 has been released! This is the first major release since the January 22nd HBase 0.92 release. In the HBase 0.94.0 release the main focuses were on performance enhancements and the addition of new features (Also, several major bug fixes).
Performance Related JIRAs
Below are a few of the important performance related JIRAs:
- Read Caching improvements: HDFS stores data in one block file and its corresponding metadata (checksum) in another block file. This means that every read into the HBase block cache may consume up to two disk ops, one to the datafile and one to the checksum file. HBASE-5074: “Support checksums in HBase block cache” adds a block level checksum in the HFile itself in order to avoid one disk op, boosting up the read performance. This feature is enabled by default.
- Seek optimizations: Till now, if there were several StoreFiles for a column family in a region, HBase would seek in each such files and merge the results, even if the row/column we are looking for is in the most recent file. HBASE-4465: “Lazy Seek optimization of StoreFile Scanners” optimizes scanner reads to read the most recent StoreFile first by lazily seeking the StoreFiles. This is achieved by introducing a fake keyvalue with its timestamp equal to the maximum timestamp present in the particular StoreFile. Thus, a disk seek is avoided until the KeyValueScanner for a StoreFile is bubbled up the heap, implying a need to do a real read operation. This should provide a significant read performance boost, especially for IncrementColumnValue operations where we care only for latest value. This feature is enabled by default.
- Write to WAL optimizations: HBase write throughput is upper bounded by the write rate of WAL where the log is replicated to a number of datanodes, depending on the replication factor. HBASE-4608: “HLog Compression” adds a custom dictionary-based compression of HLogs for faster replication on HDFS datanodes, thus improving overall write rate for HBase. This feature is considered experimental and is off by default.
New Feature Related JIRAs
Here is a list of some of the important JIRAs related to adding new features:
- More powerful first aid box: The previous HBck tool did a good job of fixing inconsistencies related to region assignments but lacked some basic features like fixing orphaned regions, region holes, overlapping regions, etc. HBASE-5128: “Uber hbck”, adds these missing features to the first aid box.
- Simplified Region Sizing: Deciding a region size is always tricky as it varies on a number of dynamic parameters such as data size, cluster size, workload, etc. HBASE-4365: “Heuristic for Region size” adds a heuristic where it increases the split size threshold of a table region as the data grows, thus limiting the number of region splits.
- Smarter transaction semantics: Though HBase supports single row level transaction, if there are a number of updates (Puts/Deletes) to an individual row, it will lock the row for each of these operations. HBASE-3584: “Atomic Put & Delete in a single transaction” enhances the HBase single row locking semantics by allowing Puts and Deletes on a row to be executed in a single call. This feature is on by default.
This major release has a number of new features and bug fixes; a total of 397 resolved JIRAs with 140 enhancements and 180 bug fixes. It is compatible with 0.92. This opens up a window of opportunity to backport some of the cool features back in CDH4, which is based on the 0.92 branch.
Thanks to everyone who contributed to this release and a hat tip to Lars Hofhansl of Salesforce for being the release manager.