Avoiding Full GCs in Apache HBase with MemStore-Local Allocation Buffers: Part 3
This is the third and final post in a series detailing a recent improvement in Apache HBase that helps to reduce the frequency of garbage collection pauses. Be sure you’ve read part 1 and part 2 before continuing on to this post.
It’s been a few days since the first two posts, so let’s start with a quick refresher. In the first post we discussed Java garbage collection algorithms in general and explained that the problem of lengthy pauses in HBase has only gotten worse over time as heap sizes have grown. In the second post we ran an experiment showing that write workloads in HBase cause memory fragmentation as all newly inserted data is spread out into several MemStores which are freed at different points in time.
Arena Allocators and TLABs
As identified in the previous post, we saw that the central issue is that data from different MemStores is all mixed up in the old generation. When we flush one MemStore, we only free up bits and pieces of heap instead of any large chunks. In other words, we’re violating one of the assumptions of the Java GC model — namely, that objects allocated together in time tend to die together. The allocation pattern of a random write workload guarantees nearly the opposite.
In order to attack this issue, we simply need to ensure that data for each region is allocated from the same area in the heap. In a language with manual memory management, this is typically done using a well-known pattern called arena allocation. In this pattern, every allocation is associated with a larger area of memory called an arena — the arena is simply divided up into smaller pieces as memory is allocated.
The most commonly seen application of this concept is the thread-local allocation buffer, or TLAB. In this model, each execution thread has its own memory arena, and all allocations done by that thread come from its own arena. There are several benefits to the use of TLABs:
- There is often very good locality of access between a thread and the memory it allocates. For example, a thread that is processing some database request will need to allocate some local buffers which will be referred to over and over again during that request. Keeping all such buffers near each other in memory improves CPU cache locality and hence performance.
- Since allocation is only done by a single thread from this arena, they can be satisfied with no locks or atomic operations required. This is known as bump-the-pointer allocation. The TLAB needs to maintain only a single start pointer, and allocations are satisfied by incrementing it forward by some number of bytes. This makes TLAB allocation extremely efficient.
In fact, the Sun JVM uses TLABs by default for small allocations. You can learn more about TLABs in the JVM in this excellent blog post.
MemStore-Local Allocation Buffers
Unfortunately, the TLABs used in the JVM do not help solve the fragmentation issue experienced by HBase. This is because an individual handler thread in HBase actually handles requests for different regions throughout its lifetime – so even though the allocations come from a single thread-local arena, data for different MemStores are intermixed within the TLAB. When the memory is promoted to the old generation, the data remains intermingled.
However, it’s not too difficult to borrow the concept and apply the same idea to MemStores — coining the term MemStore-Local Allocation Buffer (MSLAB). Whenever a request thread needs to insert data into a MemStore, it shouldn’t allocate the space for that data from the heap at large, but rather from a memory arena dedicated to the target region. This should have the following benefits:
- First and foremost, this means that data for different MemStores will not be intermingled near each other. When we flush a MemStore, we will be able to free the entire arena, and thus create a large free chunk in the old generation. This will hopefully reduce fragmentation and solve the garbage collection pause issue.
- Additionally, we should hope to see some benefits from CPU cache locality within a region. HBase read operations target individual regions at a time, and often need to sort or search through data in a single MemStore. By moving all the bits of data for a MemStore to be near each other, we should expect to see improved CPU cache locality and better performance.
Unfortunately, standard Java does not give programmers the ability to allocate objects from memory arenas. But, in the case of HBase, we’re not dealing with particularly large or many objects — each piece of data consists of a single KeyValue object which is not large. Additionally each object is exactly the same size, so doesn’t cause significant fragmentation. Rather, it’s the byte arrays referred to by the KeyValue objects that cause the fragmentation. So, we simply need to ensure that the byte arrays are allocated from MSLABs instead of the heap.
It turns out this is not very difficult! The KeyValue class doesn’t just contain a byte, but also an offset field pointing into the byte array. So in order to place the data for different KeyValue objects near each other, we just need to take slices of a larger byte representing the MSLAB arena. The implementation looks something like this:
- Each MemStore instance has an associated instance of a new class MemStoreLAB.
- MemStoreLAB retains a structure called curChunk which consists of a 2MB byte and a nextFreeOffset pointer starting at 0.
- When a KeyValue is about to be inserted into the MemStore, it is first copied into curChunk and the nextFreeOffset pointer is bumped by the length of the new data.
- Should the 2MB chunk fill up, a new one is allocated from the JVM using the usual method: new byte[2*1024*1024].
In order to keep this efficient, the entire algorithm is implemented lock-free, using atomic compare-and-swap operations on the nextFreeOffset pointer and the curChunk structure.
After implementing MSLABs, we expected to see significantly less fragmentation. So, to confirm this, I ran the same write-load generator as described in the prior post and graphed the results with the same methodology:
This graph shows the experiment beginning with an entirely empty heap when the Region Server is started, and continuing through about an hour and a half of write load. As before, we see the free_space graph fluctuate back and forth as the concurrent collector runs. The max_chunk graph drops quickly at first as memory is allocated, but then eventually stabilizes. I’ve also included num_blocks — the total number of separate free chunks in the old generation — in this graph. You can see that this metric also stabilizes after an hour or so of runtime.
The Best News of All
After producing the above graph, I let the insert workload run overnight, and then continued for several days. In all of this time, there was not a single GC pause that lasted longer than a second. The fragmentation problem was completely solved for this workload!
How to try it
The MSLAB allocation scheme is available in Apache HBase 0.90.1, and part of CDH3 Beta 4 released last week. Since it is relatively new, it is not yet enabled by default, but it can be configured using the following flags:
|hbase.hregion.memstore.mslab.enabled||Set to true to enable this feature|
|hbase.hregion.memstore.mslab.chunksize||The size of the chunks allocated by MSLAB, in bytes (default 2MB)|
|hbase.hregion.memstore.mslab.max.allocation||The maximum size byte array that should come from the MSLAB, in bytes (default 256KB)|
As this is a very new feature, there are still a few rough edges to be worked out. In particular, each region now has a minimum of 2MB of memory usage on the region server – this means that a server hosting thousands of regions could have several GB of wasted memory sitting in unused allocation buffers. We need to figure out a good heuristic to automatically tune chunk size and avoid this kind of situation.
There are also a few more efficiency gains to be made. For example, we currently do an extra memory copy of data when moving it into the MSLAB chunk. This can be avoided for a modest CPU improvement.
Lastly, some work needs to be done to re-tune the suggested garbage collecting tuning parameters. Given this improvement, we may be able to tune the value of -XX:CMSInitiatingOccupancyFraction to a higher value than we did in prior versions.
If you’ve been having problems with garbage collection pauses in Apache HBase, please give this experimental option a try and report back your results. According to our synthetic test workloads, it may significantly reduce or even eliminate the problem!
I had a great time working with HBase and the JVM to understand these memory fragmentation behaviors and then designing and implementing the MSLAB solution. If you found this series of posts interesting, you might be just the kind of engineer we’re looking for to join Cloudera. Please be sure to check out our careers page and get in touch!