Quality Assurance at Cloudera: Fault Injection and Elastic Partitioning

Categories: CDH Testing

In this installment of our series about how quality assurance is done at Cloudera, learn about the important role of fault injection in the overall process.

Apache Hadoop is the consummate example of a scalable distributed system (SDS); such systems are designed to provide 24/7 services reliably and to scale elastically with the addition of industry-standard hardware cost-effectively. They must be resilient and fault-tolerant to various environmental anomalies.

As you would expect, the software layer in Hadoop that provides risk-aware management is quite complex. Thus, for customers running Hadoop in production, severe system-level design flaws can potentially stay hidden even after deployment—risking the derailment of services, loss of data, and possibly even loss of revenue or customer satisfaction. Testing of SDSs is not only costly but also technically challenging.

As described in the first installment of this multipart series, Cloudera pursues continuous improvement and verification of our distribution of the Hadoop ecosystem (CDH) via an extensive QA process during the software-development life cycle. Subjecting these releases to fault injection is one step in that cycle, which also includes unit testing, static/dynamic code analysis, integration testing, system and scale/endurance testing, and finally, validation of real workloads (including customer and partner ones) on internal systems prior to release.

In this post, I will describe the homegrown fault-injection tools and elastic-partitioning techniques used internally against most of the services shipped in CDH, which have already proven useful for finding unknown bugs in releases before shipment to customers.

Tooling

Chaos Monkey, which was invented and open-sourced by Netflix, is perhaps the best-known example of a fault-injection tool for testing an SDS. Although Cloudera Engineering strongly considered using “plain-vanilla” Chaos Monkey when this effort began, eventually, we determined that writing our own similar tools would better meet long-term requirements, including the abilities to:

  • Do systematic exploration (in conjunction with random testing)
  • Introduce a wider variety of faults
  • Have finer-grained control over which faults to introduce, and when and where
  • Minimize integration with the system under test during the configuration phase. (Our implementation works at a service/role level, whereas Chaos Monkey works only at a node level.)

These internally developed tools, called AgenTEST and Sapper, in combination serve as our “Swiss army knife” for fault injections: AgenTEST knows how to inject (does the dirty job of starting and stopping the injections), while Sapper is in charge of when and what to inject. The following figure illustrates how AgenTEST and Sapper work together and in combination with elastic partitioning (more about that later):


tools-overview

Next, I’ll describe how these tools work.

AgenTEST

AgenTEST runs as a daemon service on each node of the cluster. It monitors a pre-configured folder waiting for instructions about when and what to inject. In particular, those directions are fed using files. Every time that a new file is added to the folder an injection is started; whenever the file is removed, the injection is stopped. The injection to apply is encoded in the name of the file. For example, if the file 10.0.0.1:3060~REJECT1~I is created, AgenTEST will send a “port-not-reacheable” message for every packet sent to the address 10.0.0.1 at port 3060.

AgenTEST is able to achieve different type of injections, ranging from network to disk, CPU, memory, and so on. Currently, it supports the following fault-injection operations:

  • Packets reject. Drop all the packets from/to the specified address and port and reply with the tcp-reset or port-unreachable error codes.
  • Packets drop. Quietly drop all the packets from/to the specified address and port.
  • Packets loss. Drop a configurable % of packets to the specified address and port.
  • Packets corrupt. Corrupt a configurable % of packets to the specified address and port.
  • Packets reorder. Reorder a configurable % of packets to the specified address and port.
  • Packets duplicate. Duplicate a configurable % of packets to the specified address and port.
  • Packets delay. Introduce a configurable delay for packets sent to the specified address and port.
  • Bandwidth limit. Limit the network bandwidth for the specified address and port.
  • DNSFail. Apply an injection to let the DNS fail.
  • FLOOD. Starts a DoS attach on the specified port.
  • BLOCK. Blocks all the packets directed to 10.0.0.0/8 (used internally by EC2).
  • SIGSTOP. Pause a given process in its current state.
  • BurnCPU. Run CPU intensive processes, simulating a noisy neighbor or a faulty CPU. It is possible to specify the amount of cores to burn.
  • BurnIO. Run disk-intensive processes, simulating a noisy neighbor or a faulty disk. It is possible to specify the amount of IOPS to burn and the mount point which is to be affected.
  • FillDISK. Write a huge file to the specified root device, filling up the root disk. It is possible to specify the percentage of free disk space to consume.
  • CorruptHDFS. Corrupt one HDFS file using the size and the offset provided in input.
  • RONLY. Remount as readonly one mounting point of a device.
  • FIllMEM. Consume memory. It is possible to specify the percentage of free memory to consume.
  • HANG. Hang a host running a fork bomb.
  • PANIC. Force a kernel panic.
  • Suicide. Shut down the machine.

Moreover, we are currently testing new injections such as “commission nodes” and “de-commission nodes.”

Sapper

Sapper (a “sapper” is the colloquial term for a combat engineer) is the only tool we have that has awareness of the service (e.g. HDFS) and role (e.g. NameNode, DataNode) being tested.

To use Sapper, the tester has to provide a “policy” and a CDH cluster (not necessarily driven by Cloudera Manager). An example of such a policy is shown below.

This policy tells Sapper to target Apache Solr, and in particular all nodes running Solr Server roles. It also says to traverse those nodes in a round-robin fashion at a frequency that is randomly selected in the interval of 5-10 seconds. Once that the node is selected, the injection to apply is BurnCPU (75% of the cores available and for a length of 60 seconds).

As previously stated, the notification to AgenTEST is done using files. In this case, Sapper will add a file named BURNCPU~75 to the folder /tmp/AgenTEST-inj of the target node. After 60 seconds, Sapper will delete the previously created file from the target node.

Currently, Sapper supports two different traversing strategies: round-robin and random. Given this deeper knowledge about services and roles, Sapper can also apply some injections itself without relying on AgenTEST—for example, to stop, start, re-start, and kill a specific service/role.

We’ll provide much more detail about Sapper in a future post.

Elastic Partitioning

As a complement to the use of AgenTEST and Sapper, we implement elastic partitioning as a fault injection in its own right behind the scenes, enforcing specific communication patterns across the cluster while a workload is running. In particular, it explores different partitioning scenarios on the cluster to test reliability and high availability, and to expose bugs. (Note that elastic partitioning has no awareness of what is actually running on the cluster, it only knows hostnames. Thus, it requires minimal, if any, configuration.)

Depending on the context (e.g. size of the cluster, type of system under test, and so on), different types of exploration can be preferred over others at a given time. Elastic-partitioning techniques can be classified with respect to the number of partitions they generate (seen also as the time required to explore them), but more important with respect to the type of communications they inhibit.

In the following overview of techniques, we refer to “partitioning” when enforcing specific patterns of communication, not just for when completely disconnecting nodes from the cluster (e.g. circular partitioning). Note that given a cluster of size n, there are 2n*(n-1) possible communication scenarios, making it prohibitively expensive to get exhaustive coverage. Instead, we focus on a combination of certain communication patterns.

Bipartite partitioning enforces systematic bipartition of the cluster; that is, it considers all the possible ways in which the n nodes of the cluster can be divided into two disjoint sets. Nodes that are partitioned from the cluster are completely isolated, compared to other techniques (presented below) where connectivity is only reduced.

bipartite-partioning

The figure above illustrates the set of possible partitions given a cluster with n = 4 nodes. On the first level is the entire cluster (no partitions). On the second level, there are the possible (3, 1) partitions. (That is, there are two sub-partitions: one containing three nodes and one containing a single node.) Finally, on the second level, there are the (2, 2) partitions.

This technique is particularly suited for testing leader elections and the “split-brain” phenomenon. However, at the same time, is not a great fit for systems with a single point of failure. Isolating the single node running such a service will certainly generate failures; to address such a scenario, more careful thought is required—for example, the node running the service can be excluded from the partitioning, or HA services or different partitioning strategies can be considered.

Circular partitioning forces ring-based communication across the nodes in the cluster. Given a set of nodes forming a cluster, we find all distinct cyclic orders on such a set. (Cyclic orders obtainable through circular shifts are considered redundant and then discarded.) This technique forces each node to communicate exclusively with its predecessor and successor in the ring.

circular-partitioning

The figure above shows one of the possible partitions given a cluster with n = 5 nodes. The red dotted arrows represent inhibited communications (for example, Node5 cannot communicate with Node2).

Bridge partitioning is inspired by Kyle Kingsbury’s blog posts about partitioning experiments. He defines partitioning as “A grudge which cuts the network in half, but preserves a node in the middle which has uninterrupted bi-directional connectivity to both components.”

Given a cluster of size n = 5, using bridge partitioning, we iteratively select one node and bipartite (in all possible ways) all the remaining nodes.

bridge-partitioning

The figure above shows one of the partitions generated by this technique. In particular, there are two equally sized sub-clusters {Node1, Node2} and {Node4, Node5}, and Node3 can communicate with every other node independently from the sub-cluster to which they belong. As for the bipartite, is likely to expose the split-brain phenomenon.

Star partitioning relies on a classification of the nodes of the cluster to generate a partition. In this technique, we classify the nodes of the cluster as super and regular: Super-nodes are those running dedicated roles or exposing single-point failures; regular nodes are everything else.

We iterate over the set of super-nodes. Given a super-node, we allow communication with all the other nodes in the cluster; we partition all the remaining nodes in the cluster in all the possible ways. For example, given a cluster with n = 5 nodes, let’s assume that Node2 and Node3 are super-nodes and the remaining nodes are regular ones. The figure below shows one possible partition when we are iterating on the super-node Node3.

star-partitioning

Node1 and Node2 can communicate exclusively with Node3. Node4 and Node5 can communicate with each other (and with Node3). The inference is that at least one of the super-nodes needs to be constantly available to maintain the functionality of the system under test.

Asymmetric partitioning, which is particularly esoteric, increases the exploration space. With this technique, we add a new dimension to all the partitioning as presented so far.

Previously, the communication between two nodes is considered a unique entity. Here, we introduce the concept of direction—that is, it will be possible for Nodei to send packets to Nodej but not vice-versa. Due to the high cost and the particularity of this exploration, we gave up on systematic exploration in favor of random exploration.

Finally, custom partitioning is nothing more than a more user-friendly interface, layered on top of native tools provided by the OS, to partition the cluster. Use cases for this technique are related to testing specific features (HA services, heartbeat, and so on). The user has to specify which nodes will belong to which partition, and then an elastic partitioning technique will achieve the result.

Use Cases for Elastic Partitioning

Bipartite and custom explorations are ideal for testing high availability and split-brain phenomenon because they divide the cluster into subgroups and surgically isolate the nodes. In contrast, the other techniques do not cut nodes from the cluster but rather force particular communication patterns. As stated previously, the goal is to reduce communication capacity over the cluster in order to test replication and consistency. Good candidates for these explorations are those systems that have proxy functionalities (such as forwarding requests and/or updates); for example, we force Solr instances to communicate only with ZooKeeper followers, thereby avoiding direct communication with the leader.

Conclusion

By now, you should have a good understanding of what role fault-injection tools and elastic partitioning techniques play in Cloudera’s software-development life cycle. Using these tools, we have already uncovered issues involving inconsistent behavior that would otherwise have gone unnoticed without doing fault-injection testing under multiple scenarios. (It should be noted that although we intend to file and fix any bugs we find upstream, we do not currently plan to open-source the tools themselves, pending feedback from the community.) Thanks to the multi-stage QA process during that life cycle, stability and reliability are assured.

In the next installment, we’ll cover how running/upgrading to new releases on our own internal EDH cluster contributes to product stability and reliability.

Francesco Sorrentino is a Software Engineer at Cloudera. Prior to joining Cloudera, he worked at RuntimeVerification applying the research done in predictive testing while earning his Ph.D. in computer science.

Facebooktwittergoogle_pluslinkedinmailFacebooktwittergoogle_pluslinkedinmail

3 responses on “Quality Assurance at Cloudera: Fault Injection and Elastic Partitioning

  1. Evangelos Mouzakitis

    Are there any plans to release AgenTEST and Sapper? I have tried tools like jepsen, but frankly the configuration of AgenTEST and Sapper seems _much_ simpler in comparison.

  2. Erick Milner

    Really detailed blog. Kudos.
    +1 for the release. Would be nice to have those tools eventually.