Quality Assurance at Cloudera: Highly-Controlled Disk Injection

Categories: CDH Testing Tools

Recently installed fault-injection techniques are making quality assurance processes yet more rigorous.

In a previous installment of our series about quality assurance inside Cloudera, we described the fault-injection frameworks (AgenTEST and Sapper) that Cloudera Engineering has devised. The fault-injection framework starts and stops injections, to determine when and how they should occur, respectively.

On that occasion, we presented a number of disk-related injections implemented in AgenTEST, including:

  • BurnIO: Runs disk-intensive processes, simulating a noisy neighbor or a faulty disk. It is possible to specify the amount of IOPS to burn and the mount point that is to be affected.
  • FillDISK: Writes a huge file to the specified root device, filling up the root disk. It is possible to specify the percentage of free disk space to be consumed.
  • CorruptHDFS: Corrupts one HDFS file using the size and the offset specified as input.
  • UNMOUNT: Un-mounts one mounting point of a device.
  • RONLY: Re-mounts as a read-only device for the specified mounting point of a device.

Although these injections are useful, they only act at the mount-point level and do not provide any guarantee of actually interfering with the application under test given the low-fidelity control involved (such as simulating a noisy neighbor or a faulty disk).

In this post, we will present a new set of low-level highly-controlled disk injections (HCDI) recently added to Cloudera’s fault-injection portfolio that guarantee such interference. With these new injections, we are now able to target specific files and/or folders (not possible previously). Moreover, we can decide when injections should occur in a fine-grained way: while reading, writing, opening and closing, or a combination of the above. Finally, the expressivity of the parameters for the injections are quite improved; for example, we are able to introduce a specific delay (in ms) or simulate a specific error (we can actually specify the probability of such injections) when accessing a file.

New Injections and Configuration

These new HCDIs include:

  1. DDELAY (disk access delay). Introduces a configurable latency for accesses of a specific file (or folder respectively). It is possible to specify the access mode to intercept (e.g. while opening, reading, writing, closing the file), the probability that the injection will occur, and the actual delay in ms.
  2. DCORRUPT (disk data corruption). Corrupts a configurable percentage of data read from or written to a file (or folder respectively). It is possible to specify the access mode to target (reading and/or writing), the probability of hitting the injection, and the percentage of bytes to corrupt during each access.
  3. DFAIL (disk access failure). Simulates failures while accessing a specific file (or folder respectively). As for the other injections, it is possible to specify the access mode to target (O, R, W, C), the probability that the injection will occur, and finally, the error code to return if the injection is hit.  

AgenTEST activates/deactivates these injections following a mechanism explained here. In particular, the injection to apply is encoded in the name of the file. For example, let’s assume that AgenTEST is watching the folder /tmp/AgenTEST-inj and that we run:

AgenTEST will introduce a delay of 5ms for each (100% probability) read/write operation occurring on the files in /foo/bar. The injection will stay in place until we delete this file as:

Particularly interesting is the DFAIL injection:

The last parameter is the error code to return when the injection is hit. In this particular case, the injection will generate an I/O error half of the time (50% of probability) that an access occurs.

The table below lists all possible reported codes:

hcdi-f1

AgenTEST using HCDI

There are two requirements for using HCDIs:

  1. Setting the variable LD_PRELOAD (explained in more detail below)
  2. Providing the configuration file, HCDI_CONFIG, which defaults to ~/.hcdi

The HCDI_CONFIG file (see example below) contains all the parameters needed to determine what and when to inject. This configuration file can change dynamically, and the injection will adjust accordingly.

Essentially, AgenTEST serves as a “front-end” for these HCDIs, and every time that a new injection is required, it updates the configuration file. However, modifications could also be made by hand, e.g.:

How Does It Work?

Executable programs depend on a number of shared libraries (except if statically-linked). To see which libraries are linked, it is possible to list the dependencies with the Linux ldd command.

For example, /bin/date depends on the following libraries:

When a program is executed, the dynamic linker looks at this list of libraries. It locates the libraries on the filesystem based on configuration files and environment variables, loads the libraries into memory, and links the pieces together to make a working executes the application.

hcdi-f2

The dynamic linker also provides a way to pre-load a library that gets inserted first into the chain to selectively override functions provided by other shared libraries. On Linux, this feature is available via the LD_PRELOAD environment variable.

With HCDIs, the idea is to intercept function calls, messages, or events passed between software components and use a custom implementation, or hook, to manipulate them.

hcdi-f3

This “hooking” approach has three main advantages:

  • There is no need to search for the function definition in the library, such as libc, and change it.
  • There is no need to recompile the library’s source code.
  • The application itself doesn’t know that the calls are being intercepted.

The libc.so.6 in the ldd output is the C runtime library; it provides the standard functions such as malloc(), printf(), and localtime(). To override a particular function, we simply build a shared library that exports that function. We use the proper definition of the function using dlsym and delegate to this original, if needed.

For example:

We build this into a shared library:

And then it’s ready to go:

When datehack.so is loaded by an application (in this case, date), the _init() function is automatically invoked and the proper C runtime localtime is resolved and its function pointer is away. When the application calls localtime the specified timep will be reduced to an hour earlier. The date application does not know that the the call is being intercepted and that the original parameters are modified by the datehack library. We have used variants of this method to test the vulnerability of CDH and Cloudera Manager to clock skew and leap-second events.

With this approach, it would sensible to point out issues such as method signature compatibility, inability to support statically-linked applications, and the need for configuration at load time. Fortunately, the few stubs that we “hijacked” are very mature (open, fopen, read, write, and so on), so there is little risk their signature changes. We are simply injecting ourselves between the JVM or any other application using the standard POSIX API and the OS.

Conclusion

We are confident that the addition of HCDI functionality to our fault-injection suite will amplify our ability to catch critical bugs early in CDH releases before they are shipped. (And as previously described, upstream JIRAs are an inherent part of this process.) In the future, possible extensions to this functionality could involve testing and monitoring of thread creation, mutex access, sockets, and memory allocation.

Francesco Sorrentino and Charlie Helin are Software Engineers at Cloudera.

Facebooktwittergoogle_pluslinkedinmailFacebooktwittergoogle_pluslinkedinmail

Leave a Reply

Your email address will not be published. Required fields are marked *

Prove you're human! *