Hadoop network encryption is a feature introduced in Apache Hadoop 2.0.2-alpha and in CDH4.1.
In this blog post, we’ll first cover Hadoop’s pre-existing security capabilities. Then, we’ll explain why network encryption may be required. We’ll also provide some details on how it has been implemented. At the end of this blog post, you’ll get step-by-step instructions to help you set up a Hadoop cluster with network encryption.
A Bit of History on Hadoop Security
Starting with Apache Hadoop 0.20.20x and available in Hadoop 1 and Hadoop 2 releases (as well as CDH3 and CDH4 releases), Hadoop supports Kerberos-based authentication. This is commonly referred to as Hadoop Security. When Hadoop Security is enabled it requires users to authenticate (using Kerberos) in order to read and write data in HDFS or to submit and manage MapReduce jobs. In addition, all Hadoop services authenticate with each other using Kerberos.
Kerberos authentication is available for Hadoop’s RPC protocol as well as the HTTP protocol. The latter is done using the Kerberos HTTP SPNEGO authentication standard, and it can be used to protect Hadoop’s Web UIs, WebHDFS, and HttpFS.
Why Network Encryption?
While Hadoop Security provides Kerberos authentication, it does not protect data as it travels through the network; all network traffic still goes on the clear. Hadoop is a distributed system typically running on several machines, which means that data must be transmitted over the network on a regular basis.
If your Hadoop cluster holds sensitive information (financial data, credit card transactions, healthcare information, etc.), it may be required to ensure that data is also protected while in transit through the network (to avoid eavesdropping and man-in-the-middle attacks). This is no different than accessing your bank’s website using a secure connection (using HTTPS) when you connect to it.
To address these kinds of use cases, network encryption was added to Hadoop.
Securing Hadoop’s Network Interactions
First, let’s review the different types of network interactions found in Hadoop:
- Hadoop RPC calls – These are performed by clients using the Hadoop API, by MapReduce jobs, and among Hadoop services (JobTracker, TaskTrackers, NameNodes, DataNodes).
- HDFS data transfer – Done when reading or writing data to HDFS, by clients using Hadoop API, by MapReduce jobs and among Hadoop services. HDFS data transfers are done using TCP/IP sockets directly.
- MapReduce Shuffle – The shuffle part of a MapReduce job is the process of transferring data from the Map tasks to Reducer tasks. As this transfer is typically between different nodes in the cluster, the shuffle is done using the HTTP protocol.
- Web UIs – The Hadoop daemons provide Web UIs for users and administrators to monitor jobs and the status of the cluster. The Web UIs use the HTTP protocol.
- FSImage operations – These are metadata transfers between the NameNode and the Secondary NameNode. They are done using the HTTP protocol.
This means that Hadoop uses three different network communication protocols:
- Hadoop RPC
- Direct TCP/IP
Hadoop RPC already had support for SASL for network encryption. Thus, we only needed to worry about securing HDFS data transfers and securing HTTP.
Hadoop RPC Encryption
When authentication support was added to Hadoop’s RPC protocol, SASL was used as the security protocol. SASL, or the Simple Authentication and Security Layer, is a framework which abstracts away the actual security implementation details for those higher-level protocols which want to support authentication, message integrity verification, or encryption. SASL does not specify a wire format or the protocols that implement it, but rather it specifies a handshake system whereby the parties involved in a SASL connection (the client and server) will iteratively exchange messages when a connection is first established.
SASL allows for the use of several different security mechanisms for different contexts, e.g. MD5-DIGEST, GSSAPI, or the SASL PLAIN mechanism. For protocols that properly implement the SASL framework, any of these mechanisms can be used interchangeably. Separately from the details of how a given SASL mechanism implements its security constructs, most SASL mechanisms are capable of providing several different levels of quality of protection, or QoP. This allows a single SASL mechanism, e.g. MD5-DIGEST, to optionally provide only authentication (auth), message integrity verification (auth-int), or full message confidentiality/encryption (auth-conf).
What does all of this mean in the context of Hadoop RPC? Since the Hadoop RPC system implements the SASL framework, we can utilize all of these features without any additional implementation complexity. When a Hadoop client that has Kerberos credentials available (e.g. a user running
hadoop fs -ls ...) connects to a Hadoop daemon, the SASL GSSAPI mechanism will be used for authentication. Not all clients have direct access to Kerberos credentials; however, and in these cases, those clients will use Delegation Tokens issued by either the JobTracker or NameNode. When a Hadoop client which has Hadoop token credentials available (e.g. a MapReduce task reading/writing to HDFS) connects to a Hadoop daemon, the SASL MD5-DIGEST mechanism will be used for authentication. Though the default is only to authenticate the connection, in either of these cases the QoP can be optionally set via the
hadoop.rpc.protection configuration property to additionally cause the RPCs to have their integrity verified, or fully encrypted while being transmitted over the wire.
Direct TCP/IP (HDFS Data Transfer Encryption)
Though Hadoop RPC has always supported encryption since the introduction of Hadoop Security, the actual reading and writing of file data between clients and DataNodes does not utilize the Hadoop RPC protocol. Instead, file data sent between clients and DNs is transmitted using the Hadoop Data Transfer Protocol, which does not utilize the SASL framework for authentication. Instead, clients are issued Block Tokens by the NameNode when requesting block locations for a given file, and the client will then present these tokens to the DN when connecting to read or write data. The DN is capable of verifying the authenticity of these Block Tokens by utilizing a secret key shared between the NameNode and DataNodes in the cluster.
Though this system is perfectly sufficient to authenticate clients connecting to DataNodes, the fact that the Data Transfer Protocol does not use SASL means that we do not get the side benefit of being able to set the QoP and trivially add integrity or confidentiality to the protocol.
To address this deficiency of the Data Transfer Protocol, we chose to wrap the existing Data Transfer Protocol with a SASL handshake. We wrapped the existing protocol, instead of wholesale replacing the existing protocol with SASL, in order to maintain wire compatibility in the normal case where wire encryption is disabled. This optional SASL wrapper can be enabled by setting
true in the NameNode and DN configurations and restarting the daemons. When this is enabled, the NameNode will generate and return to clients Data Encryption Keys which can be used as credentials for the MD5-DIGEST SASL mechanism and whose authenticity can be verified by DataNodes based on a shared key between the NameNode and DataNodes, similarly to the way Block Tokens are when only authentication is enabled. Note that, since the Data Encryption Keys are themselves sent to clients over Hadoop RPC, it is necessary to also set the
hadoop.rpc.protection setting to
privacy in order for the system to actually be secure.
For the full details of the HDFS encrypted transport implementation, check out HDFS-3637.
The solution for HTTP encryption was clear: HTTPS. HTTPS is a proven and widely adopted standard for HTTP encryption. Java supports HTTPS, browsers support HTTPS, and many libraries and tools in most operating systems have built in support for HTTPS.
As mentioned before, Hadoop uses HTTP for its Web UIs, for the MapReduce shuffle phase and for FSImage operations between the NameNode and the Secondary Name node. Because the Hadoop HttpServer (based on Jetty) already supports HTTPS, the required changes were minimal. We just had to ensure the HttpServer used by Hadoop is started using HTTPS.
Nodes in a cluster are added and removed dynamically. For each node that is added to the cluster the certificate public key has be added to all other nodes. Similarly, for each node that is removed from the cluster its certificate public key has to be removed from all other nodes. This addition and removal of public key certificates must be done in running tasks without any disruption. Because the default Java keystore only loads the certificates at initialization time, we needed to implement a custom Java keystore to reload certificate public keys if the certificate’s keystore file changed.
In addition, we had to modify the Hadoop HttpServer, the MapReduce shuffle HTTP client, and the FSImage HTTP client code to use the new custom Java keystore.
Setting up Network Encryption with CDH4
Now we’ll explain how to create HTTPS certificates using the Java
Keytool for a Hadoop cluster. Once the certificates have been created and made available to all nodes in the Hadoop cluster, refer to CDH4 documentation to complete Hadoop configuration for network encryption.
The following steps explain how to create the HTTPS certificates and the required keystore and truststore files, assuming we have a cluster consisting of two machines — h001.foo.com and h002.foo.com — and that all Hadoop services are started by a user belonging to the hadoop group.
Each keystore file contains the private key for each certificate, the single truststore file contains all the keys of all certificates. The keystore file is used by the Hadoop HttpServer while the truststore file is used by the client HTTPS connections.
For each node create a certificate for the corresponding host in a different keystore file:
$ keytool -genkey -keystore h001.keystore -keyalg RSA -alias h001 -dname "CN=h001.foo.com,O=Hadoop" -keypass pepepe -storepass pepepe $ keytool -genkey -keystore h002.keystore -keyalg RSA -alias h002 > -dname "CN=h002.foo.com,O=Hadoop" -keypass pepepe -storepass pepepe
For each node export the certificate’s public key in a different certificate file:
$ keytool -exportcert -keystore h001.keystore -alias h001 -file h001.cert -storepass pepepe $ keytool -exportcert -keystore h002.keystore -alias h002 -file h002.cert -storepass pepepe
Create a single truststore containing the public key from all certificates:
$ keytool -import -keystore hadoop.truststore -alias h001 -file h001.cert -noprompt -storepass pepepe $ keytool -import -keystore hadoop.truststore -alias h002 -file h002.cert -noprompt -storepass pepepe
Copy the keystore and trutstore files to the corresponding nodes:
$ scp h001.keystore hadoop.truststore email@example.com:/etc/hadoop/conf/ $ scp h002.keystore hadoop.truststore firstname.lastname@example.org:/etc/hadoop/conf/
Change the permissions of the keystore files to be read only by user and group. Make their group the hadoop group; the truststore file should be made readable by everybody:
$ ssh email@example.com "cd /etc/hadoop/conf;chgrp hadoop h001.keystore; chmod 0440 h001.keystore;chmod 0444 hadoop.truststore" $ ssh firstname.lastname@example.org "cd /etc/hadoop/conf;chgrp hadoop h002.keystore; chmod 0440 h002.keystore;chmod 0444 hadoop.truststore"
Generate public key certificates to install in your browser:
$ openssl x509 -inform der -in h001.cert >> hadoop.PEM $ openssl x509 -inform der -in h002.cert >> hadoop.PEM
Follow the instructions of your browser/OS to install the hadoop.pem certificates file.
Now that the HTTPS certificates have been created and distributed to the machines in the cluster, you’ll need to configure the Hadoop cluster with network encryption. Please refer to the CDH4 documentation, Configuring Encrypted Shuffle, Encrypted Web UIs, and Encrypted HDFS Transport, for detailed instructions on how to configure network encryption in Hadoop.
You have now set-up your cluster to use network encryption!
Alejandro Abdelnur is a Software Engineer on the Platform team, and a PMC member of the Apache Oozie, Hadoop, and Bigtop projects.
Aaron T. Myers (ATM) is also a Software Engineer on the Platform team and a PMC member for Hadoop.