Caching in Apache HBase: SlabCache

This was my summer internship project at Cloudera, and I’m very thankful for the level of support and mentorship I’ve received from the Apache HBase community. I started off in June with a very limited knowledge of both HBase and distributed systems in general, and by September, managed to get this patch committed to HBase trunk. I couldn’t have done this without a phenomenal amount of help from Cloudera and the greater HBase community.

Background

The amount of memory available on a commodity server has increased drastically in tune with Moore’s law. Today, its very feasible to have up to 96 gigabytes of RAM on a mid-end, commodity server. This extra memory is good for databases such as HBase which rely on in memory caching to boost read performance.

However, despite the availability of high memory servers, the garbage collection algorithms available on production quality JDK’s have not caught up. Attempting to use large amounts of heap will result in the occasional stop-the-world pause that is long enough to cause stalled requests and timeouts, thus noticeably disrupting latency sensitive user applications.

Garbage Collection

The below is meant to be a quick summary of an immensely complex topic, if you would like a more detailed explanation of garbage collection, check out this post.

HBase, along with the rest of the Apache Hadoop ecosystem, is built in Java. This gives us access to an incredibly well-optimized virtual machine and an excellent mostly-concurrent garbage collector in the form of Concurrent-Mark-Sweep (CMS). However, large heaps remain a weakness, as CMS collects garbage without moving it around, potentially causing the free space to be spread throughout the heap instead of in a large contiguous chunk. Given enough time, fragmentation will require a full, stop the world, garbage collection with a copying collector capable of relocating objects. This results in a potentially long stop-the-world pause, and acts as a practical limit to the size of our heap.

Garbage collectors which do not require massive stop the world compactions do exist, but are not presently suitable for use with HBase at the moment. The Garbage-First (G1) collector included in recent versions of the JVM, is one promising example, but early testing still indicates that it exhibits some flaws. JVMs from other (non-Oracle) vendors which offer low-pause concurrent garbage collectors also exist, but they are not in widespread use by the HBase and Hadoop communities.

The Status Quo

Currently, in order to utilize all available memory, we allocate a smaller heap and let the OS utilize the rest of the memory. In this case, the memory isn’t wasted – it’s used by the filesystem cache. While this does give a noticable performance improvement, it has its drawbacks. Data in the FS cache is also treated as a file, requiring us to checksum, and verify the file.  We also have no guarantee what the FileSystem cache will do and have only limited control over the eviction policy of this cache. While the Linux FS cache is nominally a LRU cache, other processes or jobs running on the system may flush our cache, adversely impacting performance. The FS cache is better than putting the memory to waste, but it’s neither the most efficient, nor the most consistent solution.

Enter SlabCache

Another option would be to manually manage the cache within Java via Slab Allocation – opting to avoid garbage collection all together. This is the approach I implemented in HBASE-4027.

SlabCache operates by allocating a large quantity of contiguous memory, and then performing Slab Allocation within that block of memory. Buffers of likely sizes of cached objects are first allocated in advance – objects are fit into the smallest buffer available that can contain them upon caching.  Effectively, any fragmentation issues are internalized by the cache, trading off some space in order to avoid any external fragmentation issues. As blocks generally converge around a single size, this method can still be quite space efficient.

Implementation

While slab allocation does not create fragmentation, other parts of HBase still can. With Slab Allocation, the frequency of stop-the-world(STW) pauses may be reduced, but the worst case maximum pause time isn’t – The JVM can still decide to move our entire slab around if we happen to be really unlucky, contributing again to significant pauses. In order to prevent this, SlabCache allocates its memory using direct ByteBuffers.

Direct ByteByffers, available in the java.nio package, are allocated outside of the normal Java heap — just like using malloc() in a C program. The garbage collector will not move memory allocated in this fashion – guaranteeing that a direct ByteBuffer will never contribute to the maximum garbage collection time. The ByteBuffer “wrapper” is then registered as an object, which when collected, is released back into the system using free.

Reads are performed using a copy-on-read approach. Every time HBase does a read from SlabCache, the data is copied out of the SlabCache and onto the heap. While passing by reference would have been the more performant solution, that would have required some way of carefully tracking references to these objects. I decided against reference counting, as reference counting opens up the potential for an entirely new class of bugs, making continuing work on HBase more difficult.  Solutions involving finalizers or reference queues, were also discarded, as neither of them guarantee timely operation. In the future, I may decide to revisit reference counting if necessary to increase read speed.

SlabCache operates as an L2 cache, replacing the FS cache in this role. The on-heap cache is maintained as the L1 cache. This solution allows us to use large amounts of memory with a substantial speed and consistency performance over the status quo, while at the same time ameliorating the downsides of the copy-on-read approach. Because the vast majority of our hits will come from the on-heap L1 cache, we do a minimum of copying data and creating new objects.

Performance

SlabCache operates at around 3-4x the performance of the file system cache, and also provides more consistent performance.

Performance comparisons of the 3 caches as followed. In each test, each cache was configured so that it was the primary (L1), and only cache of HBase. YCSB was then run against HBase-trunk.

HBase in all cases was running in Standalone mode, compiled against 0.20-append branch. As HDFS has gotten faster since the last release, I’ve also provided tests with the RawLocalFS, in order to isolate the difference between accessing the Slab cache and accessing the FS cache by removing HDFS from the equation. In this mode, CRC is turned off, and the local filesystem (ext3) is used directly. Even given these optimal conditions, SlabCache still nets a considerable performance gain.

If you’d like to try out this code in trunk, simply set MaxDirectMemorySize in hbase-env.sh. This will automatically configure configure the cache to use 95% of the MaxDirectMemorySize, and set reasonable defaults for the Slab Allocator. If finer control is desired, you are free to change the SlabCache settings in hbase-site.xml, which will allow you to have finer control over off-heap memory usage and slab allocation sizing.?

Conclusion

If you’re running into read performance walls with HBase, and have extra memory to spare, then please give this feature a try! This is due to be released in HBase-0.92 as an experimental feature, and will hopefully enable the more efficient usage of memory.

I had an amazing summer working on this project, and as an intern, I’m awed to see this feature work and be released publicly. If you found this post interesting, and would like to work on problems like this, check out the careers page.

6 Responses
  • Kevin Burton / January 19, 2012 / 1:32 AM

    You could have bypassed the copy and just kept the data in the FS cache and used mlock() to prevent it from being evicted.

  • Li Pi / January 19, 2012 / 1:39 AM

    Doesn’t mlock() deal with paging rather than the FileSystem cache?

  • Andrew / January 28, 2012 / 2:01 AM

    I think Kevin was thinking of the mmap()+mlock() way of keeping a file in the buffer cache. As noted, this still isn’t as efficient as going direct to a byte buffer in the same JVM, since it has to call into the kernel.

  • Liang Yi / March 06, 2012 / 11:00 PM

    Since it’s implemented as L2 Cache, I guess we still need make the L1 cache as large as possible until it causes long gc pause. So far in our hbase cluster with 25G heap, we don’t find obvious STW gc pause, how large memory you guys have used when you need this feature?

Leave a comment


1 × = six