One of the key principles behind Apache Hadoop is the idea that moving computation is cheaper than moving data — we prefer to move the computation to the data whenever possible, rather than the other way around. Because of this, the Hadoop Distributed File System (HDFS) typically handles many “local reads” reads where the reader is on the same node as the data:
Initially, local reads in HDFS were handled the same way as remote reads: the client connected to the DataNode via a TCP socket and transferred the data via DataTransferProtocol. This approach was simple, but it had some downsides. For example, the DataNode had to keep a thread around and a TCP socket for each client that was reading a block. There was the overhead of the TCP protocol in the kernel, as well as the overhead of DataTransferProtocol itself. There was room to optimize.
In this post, you’ll learn about an important new optimization for HDFS called secure short-circuit local reads, the benefits of its implementation, and how it can speed up your applications.
Short-Circuit Local Reads with HDFS-2246
In HDFS-2246, Andrew Purtell, Suresh Srinivas, Jitendra Nath Pandey, and Benoy Antony added an optimization called “short-circuit local reads”.
The key idea behind short-circuit local reads is this: because the client and the data are on the same node, there is no need for the DataNode to be in the data path. Rather, the client itself can simply read the data from the local disk. This performance optimization made it into CDH, Cloudera’s distribution of Hadoop and related projects, in CDH3u3.
The implementation of short-circuit local reads found in HDFS-2246, although a good start, came with a number of configuration headaches. System administrators had to change the permissions on the DataNode’s data directories to allow the clients to open the relevant files. They had to specifically whitelist the users who were able to use short-circuit local reads — no other users would be allowed. Typically, those users also had to be placed in a special UNIX group.
The main problem with HDFS-2246 was that it opened up the DataNode’s data directories to the client.
Unfortunately, those permission changes opened up a security hole: Users with the permissions necessary to read the DataNode’s files could simply browse through everything, not just things that they were supposed to have access to. This was a little bit like making the user a super-user! This might be acceptable for a few users — such as the “hbase” user — but in general, it presented problems. So although a few dedicated administrators enabled short-circuit local reads, it was not a common choice.
HDFS-347: Making Short-Circuit Local Reads Secure
The main problem with HDFS-2246 was that it opened up all of the DataNode’s data directories to the client. Instead, what we really want is to share only a few carefully chosen files.
Luckily, UNIX has a mechanism for doing just that, called “file descriptor passing.” HDFS-347 uses this mechanism to implement secure short-circuit local reads. Instead of passing the directory name to the client, the DataNode opens the block file and metadata file and passes them directly to the client. Because the file descriptors are read-only, the client cannot modify the files it was passed. And because it has no access to the block directories itself, it cannot read anything to which it is not supposed to have access.
Windows has a similar mechanism for passing file descriptors between processes. Although Cloudera doesn’t support this yet in Hadoop, in the meantime, Windows users can use the legacy block reader by setting dfs.client.use.legacy.blockreader.local to true.
Caching File Descriptors
HDFS clients often read the same block file many times. (This is particularly true for HBase.) To speed up this case, the old short-circuit local read implementation, HDFS-2246, had a block path cache. This cache allowed the client to reopen a block file that it had already read recently without asking the DataNode for its path.
Instead of a path cache, the new-style short-circuit implementation includes a file descriptor cache named FileInputStreamCache. This is better than a path cache, since it doesn’t require the client to re-open the file to re-read the block. We found that this approach improved performance over the old short-circuit local read implementation.
The size of the cache can be tuned with dfs.client.read.shortcircuit.streams.cache.size, whereas cache timeout is controlled by dfs.client.read.shortcircuit.streams.cache.expiry.ms. The cache can also be turned off by setting its size to 0. Most of the time, the defaults are a good choice. However, if you have an unusually large working set and a high file descriptor limit, you could try increasing it.
With the new-style short-circuit local reads introduced in HDFS-347, any HDFS user can make use of short-circuit reads, not just specifically configured ones. There is also no need to modify which UNIX group the users are in or change the ownership of the DataNode directories. However, because the Java standard library does not include facilities for file descriptor passing, HDFS-347 requires a JNI component in order to function. You will need to have libhadoop.so installed to use it.
In testing, HDFS-347 was the fastest implementation in all cases.
HDFS-347 also requires a UNIX domain socket path to be configured via dfs.domain.socket.path. This path must be secure to prevent unprivileged processes from performing a man-in-the-middle attack. Every path component of the socket path must be owned either by root or by the user who started the DataNode; world-writable or group-writable paths cannot be used.
Luckily, if you install a Cloudera parcel, RPM, or deb, it will create a secure UNIX domain socket path for you in the default location. It will also install libhadoop.so in the right place.
For more information about configuring short-circuit local reads, see the upstream documentation.
So, how fast is this new implementation? I used a program called hio_bench to get some performance statistics. The code for hio_bench is available at https://github.com/cmccabe/hio_test.
These tests were run on an 8-core Intel Xeon 2.13 with 12 hard drives. I used CDH 4.3.1 with an underlying filesystem of ext4. Each number is the average of three runs. Error bars are provided.
HDFS-347 is the fastest implementation in all cases, probably due to the FileInputStreamCache. In contrast, the HDFS-2246 implementation ends up re-opening the ext4 block file many times, and open is an expensive operation.
The short-circuit implementations have a bigger relative advantage in the random read test than in the sequential read test. This is partly because readahead has not been implemented for short-circuit local reads yet. (See HDFS-4697 for a discussion.)
Short-circuit local reads are a great example of an optimization enabled by Hadoop’s model of bringing the computation to the data. They’re also a good example of how, having tackled the challenges of scaling head-on, Cloudera is now tackling the challenges of getting more performance out of each node in the cluster.
If you are using CDH 4.2 or later, give the new implementation a try!
Colin McCabe is a Software Engineer on the Platform team, and a Hadoop Committer.