Avoiding Full GCs in Apache HBase with MemStore-Local Allocation Buffers: Part 3

This is the third and final post in a series detailing a recent improvement in Apache HBase that helps to reduce the frequency of garbage collection pauses. Be sure you’ve read part 1 and part 2 before continuing on to this post.

Recap

It’s been a few days since the first two posts, so let’s start with a quick refresher. In the first post we discussed Java garbage collection algorithms in general and explained that the problem of lengthy pauses in HBase has only gotten worse over time as heap sizes have grown. In the second post we ran an experiment showing that write workloads in HBase cause memory fragmentation as all newly inserted data is spread out into several MemStores which are freed at different points in time.

Arena Allocators and TLABs

As identified in the previous post, we saw that the central issue is that data from different MemStores is all mixed up in the old generation. When we flush one MemStore, we only free up bits and pieces of heap instead of any large chunks. In other words, we’re violating one of the assumptions of the Java GC model — namely, that objects allocated together in time tend to die together. The allocation pattern of a random write workload guarantees nearly the opposite.

In order to attack this issue, we simply need to ensure that data for each region is allocated from the same area in the heap. In a language with manual memory management, this is typically done using a well-known pattern called arena allocation. In this pattern, every allocation is associated with a larger area of memory called an arena — the arena is simply divided up into smaller pieces as memory is allocated.

The most commonly seen application of this concept is the thread-local allocation buffer, or TLAB. In this model, each execution thread has its own memory arena, and all allocations done by that thread come from its own arena. There are several benefits to the use of TLABs:

  1. There is often very good locality of access between a thread and the memory it allocates. For example, a thread that is processing some database request will need to allocate some local buffers which will be referred to over and over again during that request. Keeping all such buffers near each other in memory improves CPU cache locality and hence performance.
  2. Since allocation is only done by a single thread from this arena, they can be satisfied with no locks or atomic operations required. This is known as bump-the-pointer allocation. The TLAB needs to maintain only a single start pointer, and allocations are satisfied by incrementing it forward by some number of bytes. This makes TLAB allocation extremely efficient.

In fact, the Sun JVM uses TLABs by default for small allocations. You can learn more about TLABs in the JVM in this excellent blog post.

MemStore-Local Allocation Buffers

Unfortunately, the TLABs used in the JVM do not help solve the fragmentation issue experienced by HBase. This is because an individual handler thread in HBase actually handles requests for different regions throughout its lifetime – so even though the allocations come from a single thread-local arena, data for different MemStores are intermixed within the TLAB. When the memory is promoted to the old generation, the data remains intermingled.

However, it’s not too difficult to borrow the concept and apply the same idea to MemStores — coining the term MemStore-Local Allocation Buffer (MSLAB). Whenever a request thread needs to insert data into a MemStore, it shouldn’t allocate the space for that data from the heap at large, but rather from a memory arena dedicated to the target region. This should have the following benefits:

  1. First and foremost, this means that data for different MemStores will not be intermingled near each other. When we flush a MemStore, we will be able to free the entire arena, and thus create a large free chunk in the old generation. This will hopefully reduce fragmentation and solve the garbage collection pause issue.
  2. Additionally, we should hope to see some benefits from CPU cache locality within a region. HBase read operations target individual regions at a time, and often need to sort or search through data in a single MemStore. By moving all the bits of data for a MemStore to be near each other, we should expect to see improved CPU cache locality and better performance.

Implementation

Unfortunately, standard Java does not give programmers the ability to allocate objects from memory arenas. But, in the case of HBase, we’re not dealing with particularly large or many objects — each piece of data consists of a single KeyValue object which is not large. Additionally each object is exactly the same size, so doesn’t cause significant fragmentation. Rather, it’s the byte[] arrays referred to by the KeyValue objects that cause the fragmentation. So, we simply need to ensure that the byte[] arrays are allocated from MSLABs instead of the heap.

It turns out this is not very difficult! The KeyValue class doesn’t just contain a byte[], but also an offset field pointing into the byte array. So in order to place the data for different KeyValue objects near each other, we just need to take slices of a larger byte[] representing the MSLAB arena. The implementation looks something like this:

  • Each MemStore instance has an associated instance of a new class MemStoreLAB.
  • MemStoreLAB retains a structure called curChunk which consists of a 2MB byte[] and a nextFreeOffset pointer starting at 0.
  • When a KeyValue is about to be inserted into the MemStore, it is first copied into curChunk and the nextFreeOffset pointer is bumped by the length of the new data.
  • Should the 2MB chunk fill up, a new one is allocated from the JVM using the usual method: new byte[2*1024*1024].

In order to keep this efficient, the entire algorithm is implemented lock-free, using atomic compare-and-swap operations on the nextFreeOffset pointer and the curChunk structure.

Results

After implementing MSLABs, we expected to see significantly less fragmentation. So, to confirm this, I ran the same write-load generator as described in the prior post and graphed the results with the same methodology:

This graph shows the experiment beginning with an entirely empty heap when the Region Server is started, and continuing through about an hour and a half of write load. As before, we see the free_space graph fluctuate back and forth as the concurrent collector runs. The max_chunk graph drops quickly at first as memory is allocated, but then eventually stabilizes. I’ve also included num_blocks — the total number of separate free chunks in the old generation — in this graph. You can see that this metric also stabilizes after an hour or so of runtime.

The Best News of All

After producing the above graph, I let the insert workload run overnight, and then continued for several days. In all of this time, there was not a single GC pause that lasted longer than a second. The fragmentation problem was completely solved for this workload!

How to try it

The MSLAB allocation scheme is available in Apache HBase 0.90.1, and part of CDH3 Beta 4 released last week. Since it is relatively new, it is not yet enabled by default, but it can be configured using the following flags:

Configuration Description
hbase.hregion.memstore.mslab.enabled Set to true to enable this feature
hbase.hregion.memstore.mslab.chunksize The size of the chunks allocated by MSLAB, in bytes (default 2MB)
hbase.hregion.memstore.mslab.max.allocation The maximum size byte array that should come from the MSLAB, in bytes (default 256KB)

Future Work

As this is a very new feature, there are still a few rough edges to be worked out. In particular, each region now has a minimum of 2MB of memory usage on the region server – this means that a server hosting thousands of regions could have several GB of wasted memory sitting in unused allocation buffers. We need to figure out a good heuristic to automatically tune chunk size and avoid this kind of situation.

There are also a few more efficiency gains to be made. For example, we currently do an extra memory copy of data when moving it into the MSLAB chunk. This can be avoided for a modest CPU improvement.

Lastly, some work needs to be done to re-tune the suggested garbage collecting tuning parameters. Given this improvement, we may be able to tune the value of -XX:CMSInitiatingOccupancyFraction to a higher value than we did in prior versions.

Conclusion

If you’ve been having problems with garbage collection pauses in Apache HBase, please give this experimental option a try and report back your results. According to our synthetic test workloads, it may significantly reduce or even eliminate the problem!

I had a great time working with HBase and the JVM to understand these memory fragmentation behaviors and then designing and implementing the MSLAB solution. If you found this series of posts interesting, you might be just the kind of engineer we’re looking for to join Cloudera. Please be sure to check out our careers page and get in touch!

Filed under:

9 Responses
  • Noel Grandin / March 08, 2011 / 2:10 AM

    Suggestion: allocate the byte[] using the normal java mechanisms, and keep track of how much you allocate on a per-region basis.
    As soon as a given region exceeds 2M, switch to using an arena for that region.
    That should give you the best of both worlds.

  • Vladimir Rodionov / March 09, 2011 / 12:29 PM

    Todd, I think this approach (although very clever by itself) lacks of scalability. The number of regions which can be managed by one Region server now has a hard limit (which is in low thousands) and depends on available RAM. I would keep all this data in off heap memory to avoid heap fragmentation at all (even if it requires some code re-factoring). Btw, another types of applications, which potentially have the same heap fragmentation – GC stop the world issue – all Java non-static caches and applications which relies heavily on on heap caching solutions.

  • Todd Lipcon / March 09, 2011 / 12:56 PM

    Noel: good idea. We’re definitely looking at heuristics like this to avoid the overhead. Basically this technique is about trading internal fragmentation for external fragmentation. Both have waste, just in different ways.

    Vladimir: it makes no difference whether the memory is on the Java heap or the C heap (what you’re referring to as “off heap”). Fragmentation is a problem in C/C++ just the same as in Java. The only difference is that Java can deal with it with a stop-the-world GC, whereas C will start swapping and crash. I encourage you to watch the following video to learn more about how this problem isn’t unique to the Java heap: http://www.facebook.com/video/video.php?v=696488619305

  • Vladimir Rodionov / March 10, 2011 / 11:45 AM

    Todd: The difference between Java and C/C++ is C/C++ allows (and commands) you to free memory explicitly thus making heap de-fragmentation more predictable in terms of maximum pause duration. Projects like Memcached prove that you can build in C large scale caching solution with predictable query latency. But my point was related to the lack of scalability mostly not to the advantages of off heap memory allocation, although I think that it is a good approach in some cases, especially when large scale Java application is facing SLA with quite a strict query performance requirements. Anyway, what you have done is a good step forward in a right direction. We used to suffer a lot with previous versions of HBase because of these stop-the-world pauses.

  • Todd Lipcon / March 10, 2011 / 12:48 PM

    Yes, I agree that C’s advantage is in predictability — it’s impossible to write a “hard real time” application in standard Java (ie not RTSJ).

    I agree that projects like Memcache show you can build scalable caching with predictable latency. However, if you’re familiar with the memcache source, you’ll know that they’ve had to pull similar tricks with a slab allocator in order to avoid heap fragmentation.

    I don’t think there is anything inherently unscalable about the Java heap. There are unscalable elements to current GC designs, but designs like G1 and Azul’s Pauseless GC have better scalability properties and show that there’s nothing inherent about Java memory management that precludes large heaps.

  • Peeyush Aggarwal / March 15, 2011 / 1:57 AM

    Todd, This will just postpone the issue in some cases not exactly fix it. I have been running similar tests on my own and as far as I have seen stop the world GC still happening but frequency is reduced. Would G1 gc be of some help here as it does compaction too after collection with pause time targets? and any suggestions on how to tune settings of your solution to eliminate stop the world GC’s.

  • Todd Lipcon / March 15, 2011 / 5:29 PM

    Hi Peeyush,

    Yes, for some workloads it’s not a complete fix. I’d be interested to know more about your workload. What sizes are the cells, etc? If you can provide a YCSB workload that approximates yours, it will be a useful test case for us to stress-test against.

    G1 does seem promising, but there’s a few bits about it that are currently broken. HBase’s use of ConcurrentSkipListSets ends up creating a lot of inter-region references, which then blow out the “remembered set” tracking data structures in G1. The rsets will track a certain number of “fine” references after which they start tracking coarse region->region references, which are very expensive to process after a region has been compacted. I found this was blowing out the pause time estimates beyond reasonable pause time goals and made G1 not much better than CMS in my test workload.

    I did a couple patches against the OpenJDK 7 source, but unlikely they’re get included – mostly been met with silence on the hotspot GC list.

    Thanks
    -Todd

Leave a comment


five + = 14