Avoiding Full GCs in HBase with MemStore-Local Allocation Buffers: Part 2

This is the second post in a series detailing a recent improvement in Apache HBase that helps to reduce the frequency of garbage collection pauses. Be sure you’ve read part 1 before continuing on to this post.

Recap from Part 1

In last week’s post, we noted that HBase has had problems coping with long garbage collection pauses, and we summarized the different garbage collection algorithms commonly used for HBase on the Sun/Oracle Java 6 JVM. Then, we hypothesized that the long garbage collection pauses are due to memory fragmentation, and devised an experiment to both confirm this hypothesis and investigate which workloads are most prone to this problem.

Experimental Results

Overview

As described in the previous post, I ran three different workload types against an HBase region server while collecting verbose GC logs with -XX:PrintFLSStatistics=1. I then wrote a short python script to parse the results and reformat into a TSV file, and graphed the resulting metrics using my favorite R graphing library, ggplot2:

The top part of the graph shows free_space, the total amount of free space in the heap. The bottom graph shows max_chunk, the size of the largest chunk of contiguous free space. The X axis is time in seconds, and the Y axis has units of heap words. In this case a word is 8 bytes, since I am running a 64-bit JVM.

It was immediately obvious from this overview graph that the three different workloads have very different memory characteristics. We’ll zoom in on each in turn.

Write-only Workload

Zoomed in on the write-only workload, we can see two interesting patterns:

  1. The free_space graph shows a fluctuation between about 350 megawords (2.8GB) and 475 megawords (3.8GB). Each time the free space hits 2.8G, the CMS collector kicks in and frees up about 1GB of space. This shows that the CMS initiating occupancy fraction has been tuned to a low enough value – there is always a significant amount of free space in the heap. We can also see that there are no memory leaks – the heap usage keeps a fairly consistent profile over time and doesn’t trend in any direction.
  2. Although the CMS collector is kicking in to free up heap, the max_chunk graph is seen to drop precipitously nearly down to 0. Each time it reaches 0 (eg at around t=102800) we see a sharp spike back up to a large value.

By correlating this graph with the GC logs, I noted that the long full GCs corresponded exactly to the vertical spikes in the max_chunk graph — after each of these full GCs, the heap had been defragmented, so all of the free space was in one large chunk.

So, we’ve learned that the write load does indeed cause heap fragmentation and that the long pauses occur when there are no large free chunks left in the heap.

Read-only Workload with Cache Churn

In the second workload, the clients perform only reads, and the set of records to be read is much larger than the size of the LRU block cache. So, we see a large amount of memory churn as items are pulled into and evicted from the cache.

The free_space graph reflects this – it shows much more frequent collections than the write-only workload. However, we note that the max_chunk graph stays pretty constant around its starting value. This suggests that the read-only workload doesn’t cause heap fragmentation nearly as badly as the write workload, even though the memory churn is much higher.

Read-only Workload without Cache Churn

The third workload, colored green in the overview graph, turned out to be quite boring. Because there’s no cache churn, the only allocations being done were short-lived objects for servicing each RPC request. Hence, they were never promoted to the old generation, and both free_space and max_chunk time series stayed entirely constant.

Experiment Summary

To summarize the results of this experiment:

  • The full GCs we’d like to eliminate are due to fragmentation, not concurrent-mode failure.
  • The write-only workload causes fragmentation much more than either read workload.

Why HBase Writes Fragment the Heap

Now that we know that write workloads cause rapid heap fragmentation, let’s take a step back and think about why. In order to do so, we’ll take a brief digression to give an overview of how HBase’s write path works.

The HBase Write Path

In order to store a very large dataset distributed across many machines, Apache HBase partitions each table into segments called Regions. Each region has a designated “start key” and “stop key”, and contains every row where the key falls between the two. This scheme can be compared to primary key-based range partitions in an RDBMS, though HBase manages the partitions automatically and transparently. Each region is typically less than a gigabyte in size, so every server in an HBase cluster is responsible for several hundred regions. Read and write requests are routed to the server currently hosting the target region.

Once a write request has reached the correct server, the new data is added to an in-memory structure called a MemStore. This is essentially a sorted map, per region, containing all recently written data. Of course, memory is a finite resource, and thus the region server carefully accounts memory usage and triggers a flush on a MemStore when the usage has crossed some threshold. The flush writes the data to disk and frees up the memory.

MemStore Fragmentation

Let’s imagine that a region server is hosting 5 regions — colored pink, blue, green, red, and yellow in the diagram below. It is being subjected to a random write workload where the writes are spread evenly across the regions and arrive in no particular order.

As the writes come in, new buffers are allocated for each row, and these buffers are promoted into the old generation, since they stay in the MemStore for several minutes waiting to be flushed. Since the writes arrive in no particular order, data from different regions is intermingled in the old generation. When one of the regions is flushed, however, this means we can’t free up any large contiguous chunk, and we’re guaranteed to get fragmentation:

This behavior results in exactly what our experiment showed: over time, writes will always cause severe fragmentation in the old generation, leading to a full garbage collection pause.

To be continued…

In this post we reviewed the results of our experiment, and came to understand why writes in HBase cause memory fragmentation. In the next and last post in this series, we’ll look at the design of MemStore-Local Allocation Buffers, which avoid fragmentation and thus avoid full GCs.

Filed under:

21 Responses
  • Jason / March 01, 2011 / 7:10 AM

    I think the easiest way to solve these problems is to remove the usage of the LRU cache and rely on the file system’s IO cache which is going to be much more efficient than a similar system built inside of the Java heap space?

  • Todd Lipcon / March 01, 2011 / 12:19 PM

    Hi Jason,

    As you can see in this experiment, the LRU cache is not the issue. The issue is the MemStore.

    Moving MemStore out of the Java heap would be one possibility but is very difficult since it would require a wholesale switchover from byte[] based APIs to ByteBuffer based APIs. This has been considered but avoided because byte[] is spread far and wide throughout the HBase code.

    -Todd

  • Jason / March 02, 2011 / 9:15 AM

    Todd,

    How would the ByteBuffer be used (eg, for memory mapping)?

    Jason

  • Ashwin Jayaprakash / March 02, 2011 / 12:02 PM

    For future ref – the JIRA issue: https://issues.apache.org/jira/browse/HBASE-3455

    Does slicing bytebuffers and using chunk from a large array affect concurrency?

    Are you guys aware that Terracotta also tried doing something similar – BigMemory. You have a rudimentary memory manager inside a managed platform :)

  • Todd Lipcon / March 02, 2011 / 12:19 PM

    Jason: The idea behind a ByteBuffer based interface is that we could allocate off-heap using ByteBuffer.allocateDirect. More interesting, though, is that we could allocate off-heap using JNI code to mmap MAP_ANON regions to build our allocator efficiently. I’ll cover this in the next post in this series.

    Ashwin: you’ve jumped the gun a bit and asking a question about Part 3 :) I hope to finish writing Part 3 today, and will try to address your questions within that post.

    Regarding Terracotta “BigMemory”, yes, we are aware of it, but it’s substantially different – they actually use Java serialization to move object trees back and forth from the Java heap to the native heap. This doesn’t solve our problem.

    -Todd

  • Ashwin Jayaprakash / March 02, 2011 / 12:37 PM

    Oh ok :)

    I was looking at MemStore.java and I see only adds use the lock but deletes don’t? (Perhaps I’m missing something or should wait for part 3)

    And multiple threads reading without a lock while a thread is writing to a byte[] without atomic ops/fences works? (Again…I’m sure I’ve missed something because I don’t know all the internals of KV and other Hbase components)

  • Todd Lipcon / March 02, 2011 / 12:41 PM

    Aswhin: I think these questions are more generally about the locking design in HBase rather than about the particular memory-related things discussed in this blog series. If you use IRC, you can drop by #hbase on irc.freenode.net — most of the HBase committers are available there and might be a better place to ask questions about the code.

    -Todd

  • Ashwin Jayaprakash / March 02, 2011 / 7:23 PM

    Out of curiosity I asked the concurrency gurus about byte[] and atomicity. Follow this thread if you are interested -http://cs.oswego.edu/pipermail/concurrency-interest/2011-March/thread.html#7793

  • Andy / March 07, 2011 / 12:27 PM

    Has there been any attempt to see how HBase would perform running on the legacy BEA (now Oracle) JRockit JVM rather than the legacy Sun (now of course also Oracle) JVM?

    Andy

  • Andy / March 07, 2011 / 12:42 PM

    Henrik Stahl posted a short article regarding this topic (Repost: Tips and tricks for dealing with a fragmented Java heap) at http://blogs.oracle.com/jrockit/jrockit. It might be worthwhile to look at the use of the JRockit JVM for HBase vs. the legacy Sun JVM.

    Andy

  • Todd Lipcon / March 07, 2011 / 12:46 PM

    Yes, I tried JRockit’s “deterministic GC” and hit the same issues as I saw in Sun’s JVM.

    I also attempted to try the Azul “Zing” platform but the version I tried did not support JAAS. They have since fixed this, but I haven’t had a chance to try the new release yet.

  • Andy / March 07, 2011 / 2:23 PM

    Out of curiousity, was the JRockit instance OOTB or was there some specific tuning that was applied?

  • Todd Lipcon / March 07, 2011 / 3:05 PM

    When testing with JRRT I used -XgcPrio:deterministic -XpauseTarget=50

    -Todd

  • Tom / March 08, 2011 / 9:49 AM

    Since the issues were focused on the old generation and fragmentation, was there any consideration given to tuning by modifying the NewRatio, NewSize, SurvivorRatio and/or TargetSurvivorRatio (different names exist for these options if using JRockit). Just wondering to see whether it could be minimized further by increasing the number of objects and the duration of these objects in the young generation.

  • Adrian Muraru / March 09, 2011 / 5:46 AM

    Nice articles Todd, just out of curiosity did you also evaluate the new G1 GC available in java 6 update 14? It looks like it addresses the issues you’re seeing with heap fragmentation:
    http://goo.gl/05Nq6

  • Todd Lipcon / March 09, 2011 / 2:28 PM

    Tom: Yes, we have spent some effort tuning the young generation size. However, the memstores represent several GB of live data for a large heap, and each memstore should live for at least a couple of minutes. So, to fit all of memstore in young gen, we’d need a multi-GB generation, which results in ParNew pauses lasting several seconds. This is unacceptable for latency-sensitive workloads.

    Adrian: yes, I spent several days working with G1 and ran into a couple of bugs. You can read the gory details in the thread here: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/2011-January/thread.html

  • Paul Smith / March 30, 2011 / 9:04 PM

    Todd, you wouldn’t be able to leave for posterity the full JVM options you used for this?

  • Todd Lipcon / March 31, 2011 / 10:13 AM

    Hi Paul,

    The JVM options used were:
    -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=65 -Xms8g -Xmx8g

    In this particular test, the new size is not constrained but has auto-tuned itself to 256MB (equivalent to -Xmn256m)

    Thanks
    -Todd

  • Cosmin Ene / April 06, 2011 / 5:07 AM

    Hi Todd,

    I’m trying to do the test for my project/data and I’m somehow stuck using the R graphing library, ggplot2, that you mention here. Can you share some details (scripts maybe) of how you did that?

    Thanks,
    Cosmin

  • Todd Lipcon / April 06, 2011 / 11:03 AM

    Hey Cosmin,

    You can find a description of the commands here:

    https://issues.apache.org/jira/browse/HBASE-3455?focusedCommentId=12984324&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12984324

    -Todd

Leave a comment


three × 3 =