Best Practices for Enterprise Data Hub Encryption

Categories: Cloudera Navigator Platform Security & Cybersecurity

Encryption is a key security feature in Cloudera-powered enterprise data hubs (EDHs). This post explains some best practices for deployment of Cloudera Navigator Encrypt for that purpose.

For those unfamiliar with the product, Cloudera Navigator Encrypt provides scalable, high-performance encryption for critical Apache Hadoop data. It utilizes industry-standard AES-256 encryption and provides a transparent layer between the application and filesystem. Cloudera Navigator Encrypt also includes process-based access controls, allowing authorized processes to access encrypted data while simultaneously preventing admins or super-users like root from accessing data that they don’t need to see. Cloudera Navigator Encrypt is part of Cloudera’s overall encryption-at-rest solution, along with HDFS transparent data encryption, which operates at the HDFS level, and Cloudera Navigator Key Trustee, which is a virtual “safe-deposit box” for managing encryption keys.

In this post, we’ll provide a description of the best practices as well as some information about how to customize Cloudera Navigator Encrypt users.

Master Key

The master key is the key the user knows and types when registering Cloudera Navigator Encrypt with Cloudera Navigator Key Trustee. The master key can be in the form of a single passphrase, dual-passphrase, or RSA file. It is uploaded to Navigator Key Trustee.

Deciding which key type you should use is based on your internal security practices and procedures:

  • Single-passphrase: Avoids having a physical key that can be compromised, such as an RSA key. The single passphrase is known by the administrator and can be shared to trusted peers; its length must be between 15 and 32 bytes.
  • Dual-passphrase (passphrase1 + passphrase2): Use this when you want to share the responsibility of encrypting, decrypting, or executing Cloudera Navigator Encrypt operations among two or more people. For example, to encrypt some data on a host, one or more administrators/engineers will be required to perform the operation. One or more administrators/engineers should know passphrase1 while a different set of administrators/engineers should know passphrase2.
  • RSA: Use an RSA file when you need to use a stronger key (one that is longer than 32 bytes). You should use an RSA key that has its own passphrase to help reduce the risk associated with retention of this key, such as theft or compromise. If you select this route, you should ensure that you have strong security policies centered around the storage and backup your RSA keys.

Cloudera Navigator Encrypt securely stores your master key as a Navigator Key Trustee deposit. This separates the key from the host for non-interactive operations (for example, when you restart a host and mount the disks automatically on startup). Cloudera Navigator Encrypt can perform the required operations on its own without requesting the master key from an administrator. While non-interactive operations are handled automatically administrative actions such as adding ACL rules, decrypting data, encrypting new data directories, and so on still require an administrator to enter the master key.

Naming the Mount Points

Although there is not a naming convention standard for mount points, what follows is Cloudera’s recommendation when doing a deployment of Cloudera Navigator Encrypt.

Generally, Cloudera recommends that mount points use a short name with a descriptive noun; most are named according to their intended usage. For example, /backup/ or /mirror might be used to reference a mount point intended to store backup data. In addition, some administrators also like to add a descriptive _mnt to be more explicit about the difference between normal system mount points and additional mount points for selective use. You may find that using mnt will more easily allow you to identify the parameters that are used during encryption of a data set later on. It will also improve the supportability of an issue when it is escalated to the Cloudera Support team. Furthermore, when you want to read the encryption path from the symlink we create while preparing an encrypted data space, you will be able to easily identify its components.

For example:

/encrypted_mnt/hdfs_cat/directory_to_encrypt/file.txt

Avoid mixing the name of the mount point with the category you intend to use later. (More about that in the next section.)

For example, let’s assume that we would like to encrypt /var/lib/mysql. It would not be desirable to name the mount point as mysql_backup while also using the category @mysql. The symlink that would be generated on the system would be similar to what you will find below.

mysql -> /mysql_backup/mysql/var/lib/mysql

While this name will certainly work, it is not easy to read nor is it very intuitive for anyone attempting to assist you. A better example of this activity might be to use a mount point named db_backup and a category of @secure_db for the path /var/lib/mysql. The symlink that would be generated on the system would be far more readable and user intuitive as you can see below.

mysql -> /db_backup/secure_db/var/lib/mysql

Naming the Categories

Categories are an important part of Cloudera Navigator Encrypt. They act as containers or anchors to which you can attach rules. The rules themselves are created to authorize individual processes for access to specific encrypted data in a category.

For example, in:

Navigator Encrypt acl --add --rule =”ALLOW @secure_db * /usr/sbin/mysqld”

and

Navigator Encrypt-mov encrypt @secure_rb /var/lib/mysql /db_backup

These two commands are allowing the process /usr/sbin/mysqld access to the data that will be encrypted in  /db_backup/secure_db/var/lib/mysql. In this case, you can see that the category in use is @secure_db. This category links the allowed process to the encrypted data path.

Cloudera recommends that you use an action+noun when naming a particular category. For example:

@guard_hive_metastore
@protect_db_logs
@secure_tmp
@encapsulate_db

Once you have determined how to name the categories and mount points you have to make a few additional decisions:

  • What devices, and for which mount points, will be used?
  • Where do you want to store encrypted data?
  • What type of device will you use (loop devices, virtual devices or physical devices)? There are advantages and disadvantages to each device type.

What to Encrypt on Loop Devices

A loop device is a pseudo-device that makes a file accessible as a block device. Cloudera recommends that this method be used to encrypt temporary files, log files, or small files in place instead of using eCryptfs. (See why below.) We do not recommend encrypting databases, HDFS blocks, Apache Cassandra data directories, or any other large storage spaces. We instead recommend that you use full-block device encryption for these use cases (Physical Device Encryption).

Why Loop Devices Instead of eCryptfs?

eCryptfs, a cryptographic filesystem introduced in 2009, was a successful attempt at file-level encryption, but it is nearing the end of its life cycle. Linux distributions like RHEL and CentOS 7.X have deprecated it, and other distributions are sure to follow soon.

Testing reveals that the performance of eCryptfs is not as good as that of DM-Crypt for a number of reasons, most of them related to the fact that it operates at a level above DM-Crypt.

If you start using eCryptfs, and you plan to eventually migrate to a distribution that no longer provides support for it, you will likely have to take additional steps before you complete the migration. You will not only have to migrate all your data, but depending on your use case, you may need to decrypt all your data as well—which often requires extensive resources and time.

What to Encrypt on Physical Devices

In the Apache Hadoop ecosystem, the Apache Hive Metastore (also used by Apache Impala) is a candidate for encryption. These services also contain log files and temporary files that might contain sensitive data as well. Cloudera doesn’t recommend running any other service in the cluster that might affect its performance, but if there are systems outside the cluster that might save sensitive data, you can encrypt it with Cloudera Navigator Encrypt. Any sensitive data that your company might handle—such as personal information, credit-card numbers, taxpayer identification number, social security number, fingerprints, financial information, bank statements, and medical records—are strong candidates for such encryption.

In general, Cloudera recommends that data inside the cluster storage layer (HDFS) be encrypted using HDFS encryption instead of Cloudera Navigator Encrypt. (More on that later.)

Why DM-Crypt and AES-NI For Physical Disk Encryption?

In one word: performance.

On-chip AES-NI acceleration improves the I/O performance of data encrypted using AES Ciphers by implementing some sub-steps of the AES algorithm implements directly in the hardware. According to Intel, this performance improvement removes one of the main objections to using encryption to protect data: the performance penalty.

DM-Crypt at its core has been designed to take advantage of this hardware acceleration automatically when using AES Ciphers. AES-NI was and is still actively developed by Intel, and as such, Cloudera Engineering has direct access to the development roadmap.

Available Encryption Algorithms

Cloudera Navigator Encrypt supports different encryption algorithms when using DM-Crypt based on what is provided by your host operating system and physical hardware; it does not provide any encryption algorithms on its own. When you are preparing a device for encryption you can review the available algorithms on your system by inspecting the crypto information provided by proc system or through the DM-Crypt tool sets. For additional details around algorithms and their implementation, consult the man pages for cryptsetup.

For example:

Cloudera Navigator Encrypt will select an AES encryption algorithm by default, using a key size of 256 bits, when preparing a device for eCryptfs or DM-Crypt use. The AES encryption algorithms are by far the most widely accepted and deployed algorithms.

OS-level Encryption versus HDFS Encryption

As noted previously, HDFS transparent data encryption (TDE) is the Cloudera-recommended encryption-at-rest option inside of the EDH storage layer. TDE was designed and developed to take advantage of the many facets of HDFS, which today make it a scalable and reliable storage layer for an EDH. HDFS TDE is generally easier to deploy, is fully integrated with EDH, and implements key management through a robust key provider that extends core native components.

While Cloudera Navigator Encrypt can also encrypt the HDFS blocks inside of the JVM, it does have some disadvantages versus in-cluster HDFS encryption. For that reason, Cloudera Navigator Encrypt is primarily recommended as a solution to secure sensitive data outside the EDH storage layer (that is, HDFS).

Cloudera Navigator Encrypt relies on a kernel module that is built against a systems local kernel using DKMS. As the module is built against the systems active kernel and loaded into the kernel space, it performs quite well. Operating at a lower level allows it to hook into to the filesystem in ways that userspace tools can’t. However, users have to maintain development packages such as kernel-devel, kernel-headers, and gcc on their systems. Since the module is rebuilt against each kernel release that is active on the host, significant changes in the kernel architecture can break the module and prevent you from accessing data you have previously encrypted until the module is successfully rebuilt. This is a problem that does not occur when using HDFS encryption; rather, HDFS encryption occurs at the Hadoop filesystem level inside of individual client JVMs.

Another advantage of HDFS encryption is that the KMI (Key Management Infrastructure) is now integrated with Cloudera Manager. This integration makes it easier to manage, monitor, and deploy HDFS encryption. While Cloudera Navigator Encrypt can still use the Key Trustee Server, it is not integrated with Cloudera Manager. If a node fails for any reason, there is nothing that will notify you or otherwise indicate a problem.

Cloudera Navigator Encrypt and Navigator Key Trustee HA

The Cloudera Navigator Key Trustee Server is an opaque object store that provides centralized key management inside as well as outside of an EDH. In production deployments, it is critical that this part of your infrastructure be deployed in an isolated and highly available (HA) configuration. During deployments of the Key Trustee server, you should take step to ensure the following criteria are met.

  • Isolated and discrete cluster
  • Hardened physical and network security architectures
  • Controlled user access
  • Physical Hardware isolation to ensure performance, reliability, and resiliency

Navigator Key Trustee HA today implements a hot stand-by failover system that allows the Key Trustee Client such as those found in an EDH (KMS) and outside an EDH (Cloudera Navigator Encrypt) to continue operating in a read-only mode until full service availability is restored.

As a Key Trustee client, Cloudera Navigator Encrypt can take advantage of automatic failover for read-only operations in new releases of this client. When using the --passive switch during client registration, the client will become HA aware and have the ability to operate in a fashion similar to the KMS.

In this mode, key trustee clients can read existing key material from the key management Infrastructure (KMI) but will not be able to submit new deposits for storage.

  • For previously registered clients inside of EDH, such as the Key Trustee KMS. Your system will continue to have the ability to perform read and writes to existing encryption zones.
  • For previously registered clients outside of EDH, such as Cloudera Navigator Encrypt. Your systems will continue to have the ability to mount existing block devices and operate normally on those mounted devices.
  • In both cases, new client registrations, master key submissions, and new encryption zone key submissions will not be permitted until full service availability is restored.

Conclusion

After reading this post, you should have a good understanding about the basic best practices for implementing encryption across a Cloudera-powered enterprise data hub. For more information about security generally, consult the Cloudera Security Guide.

Alex Gonzalez is a Software Engineer at Cloudera.

Luke Hebert is a Backline Customer Operations Engineer and Security SME at Cloudera.

facebooktwittergoogle_pluslinkedinmailfacebooktwittergoogle_pluslinkedinmail