Cloudera recently launched CDH 6.2 which includes two new key features in Apache HBase:
- Serial replication
- Bucket cache now supports Intel’s Optane memory
HBase has a sophisticated asynchronous replication mechanism that supports complex topologies today that include global round-robin, two way, span-in and span-out topologies.
This replication capability, to date, provides eventual consistency — meaning that the order in which updates are replicated is not necessarily the same as the order in which they were applied to the database. While this worked for many customers, order of updates on the replication endpoint was important to many use cases.
The serial replication feature provides timeline consistency for replication. In other words, the order of updates is preserved through replication to the destination cluster. There is a slight cost for this consistency and in some cases, users may find that replication is slightly slower than the default replication approach.
Configuration of this option is fairly simple (set the SERIAL flag to true) and can be applied at time of replication setup or anytime thereafter at a table level, namespace level or for a peer that replicates all tables in HBase.
HBase bucket cache
HBase’s bucket cache is a 2-layered cache that is designed to improve ready performance across a variety of use cases. The first layer is in the Java heap and the second layer of the cache can reside in a number of different locations including: off-heap memory, Intel Optane memory, SSDs or HDDs.
The recommended configuration for the bucket cache’s second layer for most customers has been off-heap. Deployments in this configuration are able to scale up to much larger memory sizes than is possible with the built-in on-heap cache, since the off-heap engine avoids JVM garbage collection pressure. The larger cache size provides significantly improved HBase read performance.
Starting with CDH 6.2, Cloudera now includes the ability to use Intel’s newly released Optane Memory as an alternate destination for the 2nd tier of the bucket cache. This deployment configuration enables you to have ~3x the size of the cache for constant cost (as compared to off-heap cache on DRAM). It does incur some additional latency compared to the traditional off-heap configuration, but our testing indicates that by allowing more (if not all) of the data’s working set to fit in the cache the set up results in a net performance improvement when the data is ultimately stored on HDFS (using HDDs).
When deploying to the cloud or using on-prem object storage, the performance improvement will be even better as object storage tends to be very expensive for random reads of small amounts of data. The table below gives a sense of the cost, size and latency trade-off required when planning on how to configure the second tier of the bucket cache.
|Storage||$ Cost / GB||Size (constant cost)||Latency|
|Off-heap DRAM||35||1.0 GB||~70 ns|
|Intel Optane¹||13||2.7 GB||180-340 ns|
|SSD||0.15||233.3 GB||10-100 µs|
|HDD²||0.027||1.3 TB||4-10 ms|
|Object storage³||0.006||5.8 TB||10-100 ms|
Read this blog to learn more about Intel and Cloudera collaboration on leveraging Optane DC Persistent Memory for performance improvement.
- Optane DC Persistent Memory Performance Overview (https://www.youtube.com/watch?v=UTVt_AZmWjM) – minute 6:53,
- https://www.qualeed.com/en/qbackup/cloud-storage-comparison/, https://www.dellemc.com/en-us/collaterals/unauth/analyst-reports/products/storage/esg-ecnomic-value-audi-dell-emc-elastic-cloud-storage.pdf