New in CDH 5.3: Transparent Encryption in HDFS

Categories: Hadoop HDFS Platform Security & Cybersecurity

Support for transparent, end-to-end encryption in HDFS is now available and production-ready (and shipping inside CDH 5.3 and later). Here’s how it works.

Apache Hadoop 2.6 adds support for transparent encryption to HDFS. Once configured, data read from and written to specified HDFS directories will be transparently encrypted and decrypted, without requiring any changes to user application code. This encryption is also end-to-end, meaning that data can only be encrypted and decrypted by the client. HDFS itself never handles unencrypted data or data encryption keys. All these characteristics improve security, and HDFS encryption can be an important part of an organization-wide data protection story.

Cloudera’s HDFS and Cloudera Navigator Key Trustee (formerly Gazzang zTrustee) engineering teams did this work under HDFS-6134 in collaboration with engineers at Intel as an extension of earlier Project Rhino work. In this post, we’ll explain how it works, and how to use it.

Background

In a traditional data management software/hardware stack, encryption can be performed at different layers, each with different pros and cons:

  • Application-level encryption is the most secure and flexible one. The application has ultimate control over what is encrypted and can precisely reflect the requirements of the user. However, writing applications to handle encryption is difficult. It also relies on the application supporting encryption, which may rule out this approach with many applications already in use by an organization. If integrating encryption in the application isn’t done well, security can be compromised (keys or credentials can be exposed).
  • Database-level encryption is similar to application-level encryption. Most database vendors offer some form of encryption; however, database encryption often comes with performance trade-offs. One example is that indexes cannot be encrypted.
  • Filesystem-level encryption offers high performance, application transparency, and is typically easy to deploy. However, it can’t model some application-level policies. For instance, multi-tenant applications might require per-user encryption. A database might require different encryption settings for each column stored within a single file.
  • Disk-level encryption is easy to deploy and fast but also quite inflexible. In practice, it protects only against physical theft.

HDFS transparent encryption sits between database- and filesystem-level encryption in this stack. This approach has multiple benefits:

  • HDFS encryption can provide good performance and existing Hadoop applications can run transparently on encrypted data.
  • HDFS encryption prevents attacks at the filesystem-level and below (so-called “OS-level attacks”). The operating system and disk only interact with encrypted bytes because the data is already encrypted by HDFS.
  • Data is encrypted all the way to the client, securing data both when it is at rest and in transit.
  • Key management is handled externally from HDFS, with its own set of per-key ACLs controlling access. Crucially, this approach allows a separation of duties: the key administrator does not have to be the HDFS administrator, and acts as another policy evaluation point.

This type of encryption can be an important part of a certification process for industrial or governmental regulatory compliance. And that requirement is important in industries like healthcare (HIPAA), card payment (PCI DSS), and the US government (FISMA).

Design

Securely implementing encryption in HDFS presents some unique challenges. As a distributed filesystem, performance and scalability are primary concerns. Data also needs to be encrypted while being transferred over the network. Finally, as HDFS is typically run as a multi-user system, we need to be careful not to expose sensitive information to other users of the system, particularly admins who might have HDFS superuser access or root shell access to cluster machines.

With transparent encryption being a primary goal, all the above needs to happen without requiring any changes to user application code. Encryption also needs to support the standard HDFS access methods, including WebHDFS, HttpFS, FUSE, and NFS.

Key Management via the Key Management Server

Integrating HDFS with an enterprise keystore, such as Cloudera Navigator Key Trustee, was an important design goal. However, most keystores are not designed for the request rates driven by Hadoop workloads, and also do not expose a standardized API. For these reasons, we developed the Hadoop Key Management Server (KMS).

Figure 1: High-level architecture of how HDFS clients and the NameNode interact with an enterprise keystore through the Hadoop Key Management Server

The KMS acts as a proxy between clients on the cluster and a backing keystore, exposing Hadoop’s KeyProvider interface to clients via a REST API. The KMS was designed such that any keystore with the required functionality can be plugged in with a little integration effort.

Note that the KMS doesn’t itself store keys (other than temporarily in its cache). It’s up to the enterprise keystore to be the authoritative storage for keys, and to ensure that keys can never be lost—as a lost key is equivalent to destruction of data. For production use, two or more redundant enterprise key stores should be deployed.

The KMS also supports a range of ACLs that control access to keys and key operations on a granular basis. This feature can be used, for instance, to only grant users access to certain keys, or to restrict the NameNode and DataNode from accessing key material entirely. Information on how to configure the KMS is available in our documentation.

Accessing Data Within an Encryption Zone

This new architecture introduces the concept of an encryption zone (EZ), which is a directory in HDFS whose contents will be automatically encrypted on write and decrypted on read. Encryption zones always start off empty, and renames are not supported into or out of EZs. Consquently, an EZ’s entire contents are always encrypted.

Figure 2: Interaction among encryption zone keys (EZ keys), data encryption keys (DEKs), encrypted data encryption keys (EDEKs), files, and encrypted files

When an EZ is created, the administrator specifies an encryption zone key (EZ Key) that is already stored in the backing keystore. The EZ Key encrypts the data encryption keys (DEKs) that are used in turn to encrypt each file. DEKs are encrypted with the EZ key to form an encrypted data encryption key (EDEK), which is stored on the NameNode via an extended attribute on the file (1).

To encrypt a file, the client retrieves a new EDEK from the NameNode, and then asks the KMS to decrypt it with the corresponding EZ key. This step results in a DEK (2), which the client can use to encrypt their data (3).

To decrypt a file, the client needs to again decrypt the file’s EDEK with the EZ key to get the DEK (2). Then, the client reads the encrypted data and decrypts it with the DEK (4).

Figures 3 & 4: The flow of events required to write a new file to an encryption zone

The above diagrams describe the process of writing a new encrypted file in more detail. One important detail is the per-EZ EDEK cache on the NameNode, which is populated in the background. This approach obviates having to call the KMS to create a new EDEK on each create call. Furthermore, note that the EZ key is never directly handled by HDFS, as generating and decrypting EDEKs happens on the KMS.

For more in-depth implementation details, please refer to the design document posted on HDFS-6134 and follow-on work at HDFS-6891.

With regard to security, there are a few things worth mentioning:

  • First, the encryption is end-to-end. Encryption and decryption happens on the client, so unencrypted data is never available to HDFS and data is encrypted when it is at-rest as well as in-flight. This approach limits the need for the user to trust the system, and precludes threats such as an attacker walking away with DataNode hard drives or setting up a network sniffer.
  • Second, HDFS never directly handles or stores sensitive key material (DEKs or EZ keys)—thus compromising the HDFS daemons themselves does not leak any sensitive material. Encryption keys are stored separately on the keystore (persistently) and KMS (cached in-memory). A secure environment will also typically separate the roles of the KMS/keystore administrator and the HDFS administrator, meaning that no single person will have access to all the encrypted data and also all the encryption keys. KMS ACLs will also be configured to only allow end-users access to key material.
  • Finally, since each file is encrypted with a unique DEK and each EZ can have a different key, the potential damage from a single rogue user is limited. A rogue user can only access EDEKs and ciphertext of files for which they have HDFS permissions, and can only decrypt EDEKs for which they have KMS permissions. Their ability to access plaintext is limited to the intersection of the two. In a secure setup, both sets of permissions will be heavily restricted.

Configuration and New APIs

Interacting with the KMS and creating encryption zones requires the use of two new CLI commands: hadoop key and hdfs crypto. A fuller explanation is available in our documentation, but here’s a short example snippet for how you might quickly get started on a dev cluster.

As a normal user, create a new encryption key:

As the superuser, create a new empty directory anywhere in the HDFS namespace and make it an encryption zone:

Change its ownership to the normal user:

As the normal user, put a file in, read it out:

Performance

Currently, AES-CTR is the only supported encryption algorithm and can be used either with a 128- or 256-bit encryption key (when the unlimited strength JCE is installed). A very important optimization was making use of hardware acceleration in OpenSSL 1.0.1e using the AES-NI instruction set, which can be an order of magnitude faster than software implementations of AES. With AES-NI, our preliminary performance evaluation with TestDFSIO shows negligible overhead for writes and only a minor impact on reads (~7.5%) with datasets larger than memory.

The cipher suite was designed to be pluggable. Adding support for additional cipher suites like AES-GCM that provide authenticated encryption is future work.

Conclusion

Transparent encryption in HDFS enables new use cases for Hadoop, particularly in high-security environments with regulatory requirements. This encryption is end-to-end, meaning that data is encrypted both at-rest and in-flight; encryption and decryption can only be done by the client. HDFS and HDFS administrators never have access to sensitive key material or unencrypted plaintext, further enhancing security. Furthermore, when using AES-NI optimizations, encryption imposes only a minor performance overhead on read and write throughput.

Acknowledgements

Transparent encryption for HDFS was developed upstream as part of a community effort involving contributors from multiple companies. The HDFS aspects of this feature were developed primarily by the authors of this blog post: Charles Lamb, Yi Liu, and Andrew Wang. Alejandro Abdelnur was instrumental in overseeing the overall design. Alejandro and Arun Suresh were responsible for implementation of the KMS, as well as encryption-related work within MapReduce. Mat Crocker and Anthony Young-Garner were responsible for integrating Navigator Key Trustee with the KMS.

Charles Lamb is a Software Engineer at Cloudera, primarily working on HDFS.

Yi Liu is a Senior Process Engineer at Intel and a Hadoop committer.

Andrew Wang is a Software Engineer at Cloudera, and a Hadoop committer/PMC member.


Learn more about Project Rhino—work done and work still to be completed—in this live Webinar on Thurs., Jan. 29, at 10am PT.

Facebooktwittergoogle_pluslinkedinmailFacebooktwittergoogle_pluslinkedinmail

16 responses on “New in CDH 5.3: Transparent Encryption in HDFS

  1. Hari Sekhon

    The only other way I can think of solving this portability issue without spending a fortune on a migration cluster is to decrypt all data in-place, migrate HDFS software and re-encrypt all data afterwards using another KMS if migrating off Navigator KeyTrustee. This would likely take weeks during which you’d be in breach of compliance.

    Is decrypting the data on disk simply a case of distcp’ing every file to a non-encrypted zone directory on the same HDFS cluster? There doesn’t seem to be any explicit documentation on the process of removing encryption from files although that seems implied given how the encryption process works via client side writes in the first place?

    Regards,

    Hari Sekhon

  2. Charles Lamb

    Yes, that’s correct. To un-encrypt a file, you need to copy it from the Encryption Zone to a non-Encryption Zone. DistCp is one way. Of course, any copy utility can be used.

  3. Aarush R

    Thanks Justin. I was looking more for the functional and performance differences primarily to better understand how secure one is over the other to encrypt PII data. Being free and open source, definitely helps, but as you can imagine PII data being compromised is a huge risk so trying to better understand the technical capabilities of one solution over the other

    1. Justin Kestelyn (@kestelyn) Post author

      Aarush,

      In summary:

      Vormetric encryption happens at the Linux filesystem level where it is not “aware” of the workload running on top of it. In contrast, HDFS encryption, being tightly integrated with HDFS itself, is able to perform tasks in parallel with existing bulk read/write operations in a way that results in highlight optimized performance.

  4. Rajat Sharma

    I just didn’t understand that how exactly is the data decrypted when a piece of code wants to perform any operation on the data present on any datanode in a hadoop cluster.

    1. Robert Marshall

      HDFS encryption is on a directory by directory basis (Encryption Zone = HDFS directory). After you create a directory as an EZ, any file stored in that directory (on any node, right) will be encrypted.

  5. Naval

    Can someone provide some estimate as what would be memory consumption for encryption and decryption step compared to non-encryption. I am facing issues related to Java Heap space out of memory while reading files from an encrypted hdfs.

  6. SAM

    How will a program that uses the Hadoop FS API get decrypted data from a file that’s encrypted this way? Does the program have to interact with the Key Manager in some manner?

  7. David L

    Followup on the Vormetric discussion, if Vormetric encryption happens at the linux filesystem level, it seems to imply that it is not end-to-end like HDFS encryption, i.e. only address data-at-rest encryption but not data in-flight, is that correct?

    1. Justin Kestelyn (@kestelyn) Post author

      You should probably check with Vormetric directly about that. Don’t want to provide inaccurate info.

  8. Manohar

    When client is reading/writing data to EZ in HDFS cluster will have access to “DEK” to encrypt/decrypt respectively. Does this mean client having hold of DEK is vulnerability, since many times the clients are like Gateways to HDFS cluster.

  9. Rachit

    Can i encrypt full HDFS ?
    I want to create encryption zone with path “/” root . Is it possible ?
    Also , Just in case if i want to make the encryption zone as non encryption zone , How can i do it , with out moving the files to different folder ?