This case study is an instructive example of how performance analysis is a multi-faceted process that often leads one in surprising directions.
Apache Solr Near Real Time (NRT) Search allows Solr users to search documents indexed just seconds ago. It’s a critical feature in many real-time analytics applications. As Solr indexes more and more documents in near real time, end-user expectations for performance get higher and higher.
However, recently the Cloudera Search team found that Solr NRT indexing throughput often hit a bottleneck even when there are plenty of CPU, disk, and network resources available. Latency was average, in the hundreds of milliseconds range. Considering that Solr NRT indexing is a mainly machine-to-machine operation, without a human waiting for indexing to complete, that latency range was actually fairly good.
Furthermore, some customers reported other issues under heavy Solr NRT indexing workloads, such as connection resets, that could be “cascading” performance effects as a result of the throughput bottleneck. This behavior changed little even with different indexing batch sizes, numbers of indexing threads, or numbers of shards and replicas involved.
In the remainder of this post, I’ll describe how we identified the cause of this bottleneck via scientific method for performance analysis and custom tools, designed a permanent fix in partnership with Solr committers, and the lessons learned for doing performance analysis overall.
Problem Statement and Initial Approach
In addition to the throughput bottleneck, in some quick experiments by the Cloudera Search team in response to customer reports, Solr NRT showed seemingly random fluctuation in indexing throughput and latency under stable load (see below)—whereas when the load was not evenly distributed across all nodes, a sharding algorithm worked just as well. Based on this testing, it appeared that the Solr node with higher load was more likely to break before the system reached its limit.
Based on this testing, lock contention, which usually results in a performance bottleneck and underutilized resources, was our first “suspect.” We knew that using a commercial Java profiler, such as Yourkit, JProfiler and Java Flight Recorder, would help easily identify locks and determine how much time threads spend waiting on them. Meanwhile, the team had built custom infrastructure that allows one to run experiments with a profiler attached via a single command-line parameter.
In my own testing, the profiler data indeed revealed some contention particularly related to
HdfsUpdateLog locks, leading to long thread wait time. Although promisingly, this result corresponded somewhat to the description in SOLR-6820, nothing actionable resulted from the experiment.
Next, with some help from Solr committers, I tried to find any code blocks that could be optimized to minimize the problem. Unfortunately, this path was also a dead end due to the complexity of the code around these locks.
Another typical approach I took was to scan all logs for any explicit errors or warning messages. Unfortunately, any errors or warnings found were either irrelevant or likely cascading effects.
It was beginning to look like we faced a case of implicit failure, which is a failure without explicit data points that suggest its cause. I was possibly on the wrong path.
Performance Analysis in Complex Systems
Although these initial results were frustrating, performance analysis typically begins this way because the way forward is often unclear.
At this point, it’s helpful to review some fundamentals:
- As many performance issues are implicit failures, which can’t be associated with an explicit data point such as an error or warning message, it can be hard to pick the right direction in which to start. In practice, engineers often have to make several “false starts.”
- Many performance issues are non-fatal failures that have no single static state in which the failure is represented. Thus, engineers have to analyze the system dynamically, while it’s running.
- When cascading effects are present—meaning a non-performant component causes other components to work in non-performant way—they can lead to dead ends. Furthermore, identifying root causes from cascading effects is often not trivial; it requires engineers to untangle complex relationships across components.
- Multiple performance issues can be present in a running system, and it can be difficult to prioritize them. (“The real task isn’t finding an issue, it’s identifying the issue or issues that matter the most,” says Brendan Gregg in his book Systems Performance.) Therefore, in addition to exploration, engineers have to evaluate the impact of every issue found and the potential cost of the fix to decide if it is the issue “that matters.”
To summarize, as DTrace co-inventor (and Joyent CTO) Bryan Cantrill has pointed out, “Problems that are both implicit and non-fatal represent the most time-consuming, most difficult problems to debug because the system must be understood against its will.”
In addition to these general performance challenges, lock-contention optimization requires a thorough understanding of the code involved and close collaboration with the upstream community to design and deploy a permanent fix.
So, I had my work cut out for me. It was time to explore some other options.
Beginning the Exploration
As I explained previously, implicit failures often lead to false starts. In my case, they included (just a partial list):
- Inspecting Linux metrics collected by SAR, such as CPU, disk, network, paging. Nothing jumped out except for high context switching during experiments.
- Inspecting HDFS logs for HDFS I/O write latency and HDFS I/O load distribution across all nodes. (As Cloudera Search is built on top of HDFS, any I/O performance issue in HDFS can impact Solr performance.)
- Observing the GC. During experiments a small GC spike was observed; apparently a large amount of memory recycled during some short period of time. However, GC overhead and pause time were low, so GC was unlikely to be the cause of the bottleneck.
Despite these false starts, I still had a feeling that some piece of the Solr indexing code path needed improvement. The little GC spike stood out; it was true that the memory unhelpfully recycling wasted some CPU, but that alone should not result in a performance problem because there were still plenty of CPU cycles available when Solr hit the bottleneck. Could this memory recycling be causing problems other than burning some CPU cycles, maybe in combination with something else we didn’t know yet?
I decided to explore memory recycling more deeply. In particular, memory allocation is interesting because that’s where most CPU cycles are spent. The JVM exposes a few metrics about memory usage through JMX, which is tracked by Cloudera Manager. The closest metric is JVM heap usage; tracking heap usage can provide a sense of how memory is allocated. However, in this case, memory was recycled at a very high rate so the high amount of memory released cancelled out the memory allocated, thus implying roughly stable numbers during the memory recycle period.
Thus, the team decided to define a new metric that would better quantify the rate of memory allocation. We call it the Solr memory allocation estimation:
Solr memory allocation est. = sum of memory allocated in every second
Unlike JVM heap usage, this metric doesn’t take any memory released into account. It’s calculated based on the JVM GC log, and although it’s just an estimate (the log doesn’t provide all the data required to calculate the metric precisely), the result is pretty good in practice. The metric is implemented via a custom ETL pipeline built on top of a data processing API used internally by the Cloudera Search team; after data processing, the metric is visualized in a custom Grafana dashboard. (Because off-the-shelf performance tools or logs often don’t give us the exact information we need, we’ve built and routinely use a suite of custom-built ones.)
The following is a visualization of Solr memory allocation (in KB/second) observed during testing. As you can see, during a short period of time, Solr allocated as much as 9GB of memory per second. And from the GC log, we knew that most memory allocated was released almost immediately.
The second observation from the chart is how dramatically memory allocation can change in a short period of time. That’s an example of the dynamic nature of non-fatal failures.
What could cause such a huge memory allocation spike? The initial hypothesis was that the Apache Lucene library inside Solr was doing a segment merge. However, it was unclear how a segment merge could result in such massive memory allocation.
Although this direction turned out to be a dead end, the observation was interesting. I felt there could be a better chance for a breakthrough if we did more observations from different angles.
The visualization of Solr memory recycling not only told us how much memory was recycled, but also when it happened. It became clear that if we used the timing information to narrow the profiler data obtained previously, we might get a breakthrough.
Indeed, we did: after further testing, we learned that during the short period of time when Solr recycled that huge amount of memory, most of it went to Solr’s Block Cache.
Cloudera Search uses HDFS for more robust, scalable index file storage than local disk, and Block Cache is used to cache HDFS blocks when accessing indexes to improve performance. (It provides functionality similar to the Linux filesystem cache for HDFS.) In a typical setting, Block Cache uses off-heap memory to store the cached blocks. In JVM heap, it stores only metadata used to track the blocks stored in off-heap memory.
Two questions come out of this observation:
- Why was Block Cache recycling such a huge amount of memory at some seemingly random point? Block Cache only stores metadata on heap. Therefore, one would expect a small amount of heap memory allocation events for Block Cache.
- Was memory recycling related to the performance bottleneck?
While a Solr committer was trying to figure out Question 1, I focused on Question 2. Looking at some early analysis that implied lock contention in indexing, another hypothesis came to me: that the memory recycling might be happening inside the critical section of a lock. The exact combination of lock contention, and CPU cycles wasted on memory recycling, could be slowing down Solr significantly.
If that were true, the stack of these memory-allocation events would be inside one of the lock critical sections. To verify, we needed to collect the stack of each suspicious memory allocation event, aggregate the stacks, and compare the source code to see if it matched any lock critical section. The answer was yes; the profiler revealed that heavy lock contention in Solr NRT indexing caused by excessive Block Cache-related memory allocation in a critical section (see profiler screenshot below)!
Solution & Results
Now that the root cause of the lock contention was clear, it was time to collaborate with the Solr committers on the Cloudera Search team for the solution. Soon, Mark Miller proposed the idea that there is no need to use Block Cache to cache data used only once or if data is used just for merging; preventing Block Cache usage in those situations would significantly reduce Block Cache-related memory allocations. Fortunately, Lucene is designed with performance in mind and provides a hint indicating when data is read once or used for merging. This work resulted in SOLR-8856.
In addition, other issues, such as SOLR-8857 and LUCENE-7115, were uncovered and fixed during this analysis. Some Linux and Tomcat network tuning suggestions were also communicated to Cloudera Support engineers and will be published in a future release of the Cloudera Search user guide.
As shown below, with these changes in place, a ~50% gain in throughput and ~50% drop in latency are now observed in our nightly performance testing. Although those results don’t include the gain from cascading effects because a system under test is carefully isolated in the performance lab, good isolation is also rare in real production systems. Therefore, customers are likely to observe even more performance improvement in their production systems due to a reduction in cascading effects.
We’ve also found that fluctuations in throughput, latency, and unevenness in load distribution are also now greatly reduced; system reliability is also improved because a weak point has been removed. So far, customer experiences have been consistent with these results.
I hope that this post serves as a good example of some of the nuances involved in performance analysis, including:
- The difficulties in diagnosing implicit failures
- Analysis of a non-fatal failure that is highly transitional
- How the use of scientific method, off-the-shelf tools, and custom-built tools are complementary
- How false starts can spark the imagination and lead to new productive directions
- The value of persistence, especially in the face of frustration
- How collaboration with upstream committers is necessary for permanent results
Furthermore, at Cloudera, the efforts to improve Solr performance never end. Although this particular issue was resolved, it has revealed more opportunities for optimization that have a place on the Cloudera roadmap. We’ll describe some of those in future posts as the work is completed.
Michael Sun is a Software Engineer at Cloudera, working on the Cloudera Search team.