Cloudera Impala has many exciting features, but one of the most impressive is the ability to analyze data in multiple formats, with no ETL needed, in HDFS and Apache HBase. Furthermore, you can use multiple frameworks, such as MapReduce and Impala, to analyze that same data. Consequently, Impala will often run side-by-side with MapReduce on the same physical hardware, with both supporting business-critical workloads. For such multi-tenant clusters, Impala and MapReduce both need to perform well despite potentially conflicting demands for cluster resources.
In this post, we’ll share our experiences configuring Impala and MapReduce for optimal multi-tenant performance. Our goal is to help users understand how to tune their multi-tenant clusters to meet production service level objectives (SLOs), and to contribute to the community some test methods and performance models that can be helpful beyond Cloudera.
Defining Realistic Test Scenarios
Cloudera’s broad and diverse customer base makes it a top concern to do testing for real-world scenarios. Realistic tests based on common use cases offer meaningful guidance, whereas guidance based on contrived, unrealistic testing often fails to translate to real-life deployments.
For Impala, our primary test workload directly replicates the queries, schema, and data selectivity of Impala customers. This approach allows us to test query and schema structure common to several classes of use cases. (We often collaborate with customers to obtain similar realistic test scenarios for all CDH components, as well.) Customers benefit by getting direct performance guidance for their specific use cases.
Guidance based on contrived, unrealistic testing often fails to translate to real-life deployments.
Because online analytical processing (OLAP) is one of many possible Impala use cases, we also run TPC-H, an established benchmark for OLAP, as a secondary workload for Impala. Although TPC-H provides a large, public repository of independently audited results, those results are based on a 1990s survey of OLAP users — so we run a subset of TPC-H queries that represent prominent use cases today.
For MapReduce, we use a collection of benchmarks designed to use different data formats and stress different stages of the MapReduce compute pipeline. Some examples include the TeraSort Suite (includes TeraGen, TeraSort, and TeraValidate), data shuffling jobs, and jobs that process varying amounts of randomness in the data. These tests are intentionally different from other MapReduce tests that use open source tools like SWIM to directly replay customer workloads of thousands of MapReduce jobs. (Such workloads capture complexity and diversity and will be added in the near future.) In contrast, for our initial multi-tenant tests, we emphasize precisely controlled tests that create easily repeatable, well-defined loads.
The multi-tenant tests include the full combination of our Impala queries, run concurrently with our MapReduce benchmarks. We iterate through the combination — each time a different Impala query with a different MapReduce job. Multiply this by the range of data sizes, data formats, Impala query concurrency levels, and MapReduce job concurrency levels, and the result is a test matrix of hundreds of settings. This broad test matrix builds confidence that our findings are meaningful and that Impala can share cluster resources with MapReduce as expected.
Modeling Multi-tenant Performance Expectations
Multi-tenant resource management involves assigning different resources to workloads from different computation frameworks. It is intractable to model that for the full matrix of hundreds of tests. Instead, for simplicity, we constrict resource assignments to statements like this: “When cluster resources are under contention, Impala gets fraction x of all resources, and MapReduce gets the rest.”
Our primary test workload directly replicates the queries, schema, and data selectivity of Impala customers.
If Impala and MapReduce share resources well, having a fraction of all resources should slow down performance proportionally when the resources in question are under contention. In other words, Impala multi-tenant performance should drop to no less than fraction x of its stand-alone performance, and MapReduce multi-tenant performance should drop to no less than fraction 1-x of its stand-alone performance. Our goal is to find and validate a set of resource management controls such that the observed performance meets or exceeds these expectations.
This is the most conceptually simple model and more elaborate ones are possible. For example, if a MapReduce job runs longer than an Impala query, multi-tenant slowdown would cover only the period that they are both running. The Impala query would see fraction x of stand-alone performance as in the previous model. The MapReduce job would see fraction 1-x of stand-alone performance for the initial part of the job, and stand-alone performance for the rest.
Another extension of the model would be to cover non-uniform resource assignments. For example, most Impala queries have heavy memory demands and most MapReduce jobs have heavy disk IO demands. If we assign Impala 50% of CPU, 60% of memory, and 40% of disk, all the resource changes could simultaneously affect performance and we expect queries in general to slow by some range within 40-60%. This more complex style of resource allocations along multiple dimensions is supported in CDH but not evaluated here.
Controlling Shared Resources
The full set of multi-tenant resource management controls are documented in “Setting up a Multi-tenant Cluster for Impala and MapReduce” within the Cloudera Manager installation guide. They include controls to manage memory, CPU, and disk IO for each active compute framework. The documentation there contains numerical examples and step-by-step guides on how to configure each of the following resources.
- Memory. The controls include Impala Daemon Memory Limit — a direct setting for Impala memory consumption, and Maximum Number of Simultaneous Map Tasks and Maximum Number of Simultaneous Reduce Tasks— indirect memory controls for MapReduce. For memory management, our primary concern is to prevent memory thrashing, job/query failures, or one of the frameworks taking all the resources and preventing workloads in other frameworks from making progress.
To give Impala a fraction x of all memory, we set Impala Daemon Memory Limit to (RAM size / accounting factor) * x. The “accounting factor” is a numerical constant to adjust for OS overhead in translating resident set size to actual RAM use. We’ve found that 1.5 is a relatively safe value, while 1.3 is a more realistic but also more risky value.
For MapReduce, we cut to fraction 1-x the default value for Maximum Number of Simultaneous Map and Reduce Tasks. For example, suppose we give Impala 25% of cluster resources; i.e., x = 25%, we would cut Maximum Number of Simultaneous Map and Reduce Tasks to 75% their default values. If the default values are maximum of 16 simultaneous map tasks and 8 simultaneous reduce tasks, we would set the new values to maximum of 12 simultaneous map tasks and 6 simultaneous reduce tasks. This effectively gives MapReduce the remaining fraction 1-x of the cluster’s memory, as each task carries its own JVM container that holds the data and code needed for the map() and reduce() functions. With this change, any per-job tuning related to slot capacity should be adjusted also. (Note that slot is the smallest unit of adjustment in MR1. For CDH5 and YARN in the future, we would specify instead the quantity of RAM needed. YARN would hand out resources to ApplicationMasters, one of which happens to run MapReduce jobs.)
- CPU. We control CPU use through Linux Control Groups (Cgroups). Cgroups is a Linux kernel feature that users can configure via Cloudera Manager 4.5.x and onward. The more Cgroup CPU Sharesgiven to a role, the larger its share of the CPU when under contention. For example, a process with 4 CPU shares will be given roughly twice as much CPU time as a process with 2 CPU shares. We use these controls as “soft limits” — i.e., the settings will have no effect until processes on the host (including both roles managed by Cloudera Manager and other system processes) are contending for all the CPUs.
In a multi-tenant workload, all Impala and MapReduce processes – Impala daemons, DataNodes, and TaskTrackers – may be simultaneously consuming CPU. To assign fraction x of CPU to Impala and the rest to MapReduce, we set fraction x of CPU shares to the Impala daemon, and half of the remainder each to the DataNode and the TaskTracker.
This is a conservative setting for MapReduce. It assumes that DataNode and task computational activity can simultaneously maximize CPU activity. If your MapReduce workload is far below cluster capacity, this is unlikely to be true. For example, when you have one MapReduce job, usually DataNode (job input/output) and TaskTracker (tasks) CPU activities will not overlap. It’s only when you have a very large workload that in aggregate, DataNode and TaskTracker activity will be simultaneous.
- Disk. Disk IO is also controlled by Cgroups. The more the Cgroup I/O Weightgiven to a role, the higher priority will be given to I/O requests made by the role when I/O is under contention. The effect is nearly identical to that of the CPU controls.
We also choose a conservative setting to assign disk resources: The Impala daemon gets fraction x of IO weights; half of the remainder goes to the DataNode, and the other half goes to the TaskTracker.
For future tests, we will explore how to control Impala and MapReduce workload intensity to create sustained conflicts in resource demands.
Multi-tenant Performance Results
Our tests have these goals:
- Measure uncontrolled multi-tenant behavior
- Measure multi-tenant behavior using controls described so far
- Understand resource contention in both set of tests above
- Understand stand-alone behavior subject to resource management controls
Our test cluster machines have 64GB memory, 12 cores of 2.0GHz Intel Xeon, 12x 2TB disks, and 10Gbps Ethernet. The baseline behavior is MapReduce running by itself, and Impala running by itself, on the cluster without any resource management controls. We express multi-tenant performance as a fraction of this uncontrolled, stand-alone performance. This is a good baseline because the uncontrolled, stand-alone setup is equivalent to dedicated MapReduce or Impala clusters.
Uncontrolled Multi-tenant Behavior
We measured multi-tenant performance without the resource management controls described previously. For our tests, MapReduce consumed the majority of cluster resources, which heavily affected Impala performance. Specifically, uncontrolled multi-tenant MapReduce dropped to a median of ~90% its uncontrolled stand-alone performance and uncontrolled multi-tenant Impala dropped to a median of ~50% of uncontrolled stand-alone performance.
These numbers reflect a higher than expected uncontrolled multi-tenant performance. The suspected reason is that Impala and MapReduce resource demands are not always in conflict. For our future tests, we will explore how to control Impala and MapReduce workload intensity to create sustained conflicts in resource demands.
Our resource management mechanisms improve on this uncontrolled behavior in several ways. First, the cluster resources are less likely to become overcommitted. Specifically, the controlled behavior setup is more “memory safe” such that memory-heavy workloads have a lower risk of causing swapping/thrashing. Second, the user gains the ability to dial-up or dial-down the resources consumed by each framework. For example, if the uncontrolled setup results in Impala performance that is below production SLOs, the user can turn on resource controls and shift resources from MapReduce to Impala so that both can meet desired SLOs.
Multi-tenant Behavior under Resource Management
The resource management controls allow us to assign fraction x of all cluster resources to, say, Impala, and the rest to MapReduce. We measured performance for 25-75, 40-60, 50-50, 60-40, and 75-25 resource splits between Impala and MapReduce. The test results demonstrate that we can indeed control multi-tenant performance by assigning more resources to a compute framework.
The graphs below show Impala and MapReduce multi-tenant performance. Each data point on the graph shows the median slowdown for that setting while the error bars show the 25-75 percentile slowdown across the test matrix. They also show the expected performance according to the previous models.
The graphs indicate that Impala and MapReduce multi-tenant performance both meet or exceed predicted performance according to the previous models. There are some Impala query and MapReduce job combinations where performance is barely affected in a multi-tenant environment (error bars close to fraction 1 of stand-alone performance). There are also some combinations where performance is affected beyond that predicted by earlier models (error bars below the “Modeled” line). Most combinations meet or exceed our expectations. As before, the suspected reason is that Impala and MapReduce resource demands are not always in conflict.
Verifying Resource Contention
We used Cloudera Manager to monitor physical resource utilization during both the uncontrolled and controlled multi-tenant tests. Memory is almost fully used all the time. CPU and disk loads are bursty and contention occurs only some of the time. This behavior likely explains our test results where Impala and MapReduce both exceed the performance model.
Stand-alone Behavior with Resource Management
Multi-tenant resource management should behave as soft limits. In other words, when there is only one framework active, it should behave as if it has the entire cluster to itself. The CPU and disk controls above are soft limits by design while the memory controls are hard limits. Hence, we felt it necessary to measure stand-alone performance subject to resource management controls. Note this is different from the previous baseline of stand-alone performance without resource management controls.
Impala behaves exactly as if it had the entire cluster to itself. This is the intended behavior for the Impala daemon memory limit. Either the query behaves unaffected, or, upon hitting the memory limit, the query would be aborted.
For MapReduce, the behavior is more complex. Cutting the available slots places an upper bound on the amount of parallelism. For modest cuts, there would still be enough parallelism to drive resources to full utilization, and the performance impact would be small. For severe cuts, the remaining slots would not be enough to drive resources to full utilization, and the performance impact would be large.
What are “moderate” and “severe” cuts depend on the cluster hardware. For our test cluster, cutting the slots to 50% results in only 15% performance drop. However, cutting the slots to 25% gives stand-alone performance that is roughly the same as multi-tenant performance, meaning that the available slots, and not resource contention, becomes the performance constraint for MapReduce. Hence, depending on cluster hardware, there may be room to relax the settings for resource management controls as we described earlier.
Big data multi-tenant performance represents a cutting edge engineering problem with immediate impact for many users. Our test methods, models, and results carry a high level of real-world relevance, mathematical rigor, and empirical repeatability.
One immediate follow up effort is to expand our test scenarios to cover multi-query Impala workloads running concurrently with multi-job MapReduce workloads, each with a controlled level of workload intensity.
Overall, we hope this post can help users tune their multi-tenant clusters to meet production SLOs, and help the community at large understand how to manage resources in a shared big data platform.
Yanpei, Prashant, and Arun are members of the Performance Team at Cloudera.