One of the key challenges of building an enterprise-class robust scalable storage system is to validate the system under duress and failing system components. This includes, but is not limited to: failed networks, failed or failing disks, arbitrary delays in the network or IO path, network partitions, and unresponsive systems.
Apache Ozone fault injection framework is designed to validate Ozone under heavy stress and failed or failing system components. Specifically, this would enable injecting different types of failures in the Ozone cluster and validating system behavior in the presence of such failures. The framework is generic and extensible enough to allow injecting new classes of failures over time and writing a suite of automated test cases to validate system behavior against the newly defined failure class.
Although we have designed this fault injection framework for Ozone, it is generic enough to be used for validating any other distributed and scalable system.
What can Apache Ozone fault injection framework do?
This framework is designed to simulate failure of a variety of system components, specifically:
- Disk Failures: It can simulate disk failures and validate Ozone behavior and correctness in the presence of failed or misbehaving disks.
- Disk Slowness: It can simulate failing disks with slow response time and validate Ozone behavior and correctness in presence of such disks.
- Disk block corruption: It can simulate bad blocks on disk and validate Ozone behavior and correctness in the presence of bad blocks and corrupted chunks. It is also possible to simulate transient bad blocks that can return correct data after a while or after a restart.
- Network failure: It can simulate lost network packets and validate Ozone behavior and correctness in presence of misbehaving network and lost network packets.
- Network Delays: It can inject arbitrary delays in the network and validate Ozone behavior and correctness in the presence of network delays.
- Node Failures: It can inject node failures and validate Ozone behavior and correctness in the presence of unresponsive nodes.
- Node reboots: It can simulate Node reboots at specific points and validate Ozone behavior and correctness with nodes crashing at different points.
- Network partitions: It can simulate network partitions and validate Ozone behavior and correctness in the presence of network partitions in an Ozone cluster.
- CPU starvation: It can simulate CPU starvation and validate Ozone behavior in the presence of high CPU contention.
How is this any different from existing systems?
Randomly injecting a failure and hoping to catch race conditions and possible data corruption may not always be fruitful. Analyzing failures with random failure injection also requires some manual analysis to rule out false alarms. We considered the Namazu framework but decided not to use it for very similar reasons. We also noticed that timing of the failure injection plays a critical role in the outcome. While existing fault injection frameworks offer random error injection into the system, they lack the ability to control the timing and placement of error injection relative to the test case execution. Given the shortcoming of the existing frameworks, we developed an Ozone fault injection framework that allows for random error injection as well as precise and targeted error injection. This allows us to create targeted test cases where we can inject and control failures within well outlined time windows as Ozone is serving a given request. Such a targeted test case should also have a well-defined outcome that the test can validate without manual analysis. This framework does not require any code changes to the system-under-test that is being validated. This framework simulates failures directly at the file system layer or network layer.
Introducing Fault Injection Service
Fault Injection service runs on every node where we are running one or more system components under test. This service provides REST APIs to inject/reset various types of failures. This service has one or more plugin extensions to inject different types of failures.
Fuse Extension
One key part of the fault injection service is a very lightweight passthrough fuse file system that is used by Ozone for storing all its persistent data and metadata. Service provides APIs to control how and when this file system behaves in a certain way, including injecting delays as well as failures on the read/write access path. The APIs are generic enough that we could target both Ozone data and metadata for failure/corruption/delays.
NetFilter Extension
Another key part of the service is the ability to filter network packets and return failures or introduce delays in the network. This filter can also be used to create network partitions. This is done with a custom netfilter module that can use libnetfilter_queue.
Ozone Extension
Initially we plan to use this framework for injecting failures in system components e.g. file-system or network. Over time we can do more intrusive whitebox testing by enabling and disabling various join points and delay-points within the Ozone code. We could then provide APIs to enable or disable a crash or delay behavior with a specific action.
How does it work?
The figure below depicts the overall setup required to test Ozone.
- Each node that is running an Ozone service also runs an instance of the failure injection service.
- Failure injection service exposes a REST API endpoint to dynamically configure/inject/reset a failure.
- Failure injection is carried out by intercepting file-system IOs and network packets. No changes to Ozone code required for simulating failures. In fact, Ozone is not even aware of a failure injection service.
- A Typical flow control for Apache Ozone using this Fault Injection Framework looks like this:
- Query OM/SCM/DataNodes to identify the target for failure injection. The target could be a particular Node (network endpoint), a file-system, a directory, a data-file or a byte-offset range within a given data-file.
- Use the RestAPI endpoint on a given node to inject failure. A failure action is either a delay, an error code or corrupt data chunks.
- Issue the ozone request after injecting failure
- Validate the expected output using Ozone API/CLI/log-files
- Reset the failure and move on to the next test.
Apache Ozone Bugs Found With Fault Injection Service
HDDS-3064 | Get Key is hung when READ delay is injected in chunk file path |
HDDS-3136 | Retry timeout is large while writing key |
HDDS-3163 | Write Key is hung when write delay is injected in datanode dir |
HDDS-3214 | Unhealthy datanodes repeatedly participate in pipeline creation |
Kraken with Apache Ozone Fault Injection Framework
Kraken is a unified fault injection framework that is developed by Cloudera for resiliency testing. Kraken provides a programming language agnostic, cloud-agnostic, deployment-kind agnostic framework for wrapping existing fault injection implementations and provides a simple & unified interface for users to consume. It is a hosted fault injection framework, that reduces the setup/installation complexities to next-to-zero efforts.
Ozone fault injection is now integrated with the Kraken framework to inject errors at the system level and enhance its capability to validate system robustness.
Kraken users do not need to perform any complicated setup or installations. A single command execution or an HTTP post request can set up the fault agent in the machines where the system under test is running. After this step, users can immediately start resiliency testing using the GUI or test automation using simple APIs of Kraken client SDK. With Kraken integration, the Ozone fault injection framework faults could be consumed using simple APIs of Kraken client SDK. And Kraken also provides many fault implementations inbuilt covering targeted resources – CPU, Memory, Disk, Network, Process.
With the BYOF (Bring Your Own Fault) principle of Kraken, we could integrate any other fault injection implementation with Kraken easily. And with Kraken’s unified interface (using auto-generated SDKs from swagger JSON of fault services), all of these faults could be used in resiliency tests with simple and uniform code.
The system/application under test nodes need not open SSH port for communication from test automation or any fault injection triggers. The Kraken fault agent installed on SUT nodes registers itself to the Kraken’s nodes service and listens for incoming messages on a RabbitMQ queue.
The client layer of Kraken has auto-generated SDKs, Kraken-client SDK (wrapper library on auto-generated SDKs), and GUI as of now. Kraken’s roadmap has plans to build Random Disruptor (similar to ChaosMonkey, but with Kraken’s advantages) in this layer using its builtin fault implementations. The Random Disruptor would be a client with fault injection randomness algorithm and policies configured consuming Kraken’s fault injection services.
Kraken – Builtin Faults supported:
Kraken supported faults as of now, including Apache Ozone fault injection framework’s disk failures are:
- Process Faults:
- Kill a process by name or process id.
- Hang a process by name or process id.
- Send any SIGNAL to the process by name or process id.
- CPU:
- Create CPU consumption spikes for a given period of time.
- Memory:
- Fill the main memory to a given percentage for a given period of time.
- Disk:
- Fill the disk to a given percentage of the amount.
- Integrated from Apache Ozone fault injection framework:
- Slow read
- Slow write
- Corrupt read operations
- Corrupt write operations
- Fail read operations
- Fail write operations
- Network:
- Block port for ingress or egress or both.
- Introduce Latency with given delay parameter (in seconds) for a given period.
- Introduce Packet loss
- Recovery:
- All the faults support on-demand recovery. i.e. a CPU spike fault requested for an hour could be stopped after a few minutes.