Learn how replication functionality for Apache Hive metadata and consistency benefits from automated HDFS snapshots benefit production environments.
A robust backup solution that is both correct and efficient is necessary for all production data management systems. Backup and Disaster Recovery (BDR) is a Cloudera Manager feature in Cloudera Enterprise that allows for consistent, efficient replication and version management of data in CDH clusters. Cloudera Enterprise BDR can be used for creating efficient incremental backups of HDFS and Hive data from multiple clusters, within or between data centers and across wide geographic regions. (One customer successfully used Cloudera Enterprise BDR to replicate 500TB of production data in a single replication task, and keeps that replica consistent with incremental replication of over 50 million files spread across over 500 nodes in a production cluster running mission-critical analytic workloads.)
Cloudera Enterprise BDR supports two replication mechanisms, one for HDFS and one for Hive. In addition, Cloudera Enterprise BDR facilitates creating automated snapshots of HDFS via snapshot policies, which can be used to version and protect your data from accidental deletion with seamless point-in-time recovery.
In the remainder of this post, we will describe those features and other differentiating functionality in Cloudera Enterprise BDR, and characterize aspects of Cloudera’s approach that affect its performance in production. We will also articulate recommendations on efficacy of different mechanisms and when to use them. Finally, you’ll learn about features and options that can help you accelerate Cloudera Enterprise BDR performance for the most demanding use cases to meet your production requirements.
Considerations for Production Deployments
First, let’s review some of the fundamental concepts behind replication and doing snapshots, and how they apply when using Cloudera Enterprise BDR in production deployments.
Continuous and Scheduled Replication
Continuous/log-based replication systems can replicate synchronously or asynchronously with the source of these changes, and replication is immediate with little or no latency under most circumstances (when implemented correctly). These systems replicate data based on the order of changes on the source cluster, and at best attempt to invalidate and replay all data from the source in the event of changes to the destination that leads to inconsistency.
In contrast, scheduled/state-based replication systems operate on a fixed schedule, with a deterministic time-of-day at which replication happens, controllable directly by an operator. These replicate incremental data based on the current state of the source and destination clusters, seeking to always update the state of the destination to a consistent state with reference to the data in the source cluster at a point in time.
Cloudera Enterprise BDR is an example of the latter: the data moves from your source cluster to the destination cluster when you want it, consuming cluster resources only when your workload profile allows for it. The table below summarizes the main practical differences:
HDFS replication replicates HDFS data from a source CDH cluster to a destination CDH cluster, and performs incremental copies of these data in a bandwidth-efficient manner.
This process involves launching a replication process on the cluster, which performs the following steps:
- Copy listing: Generating a list of candidate file paths on the source HDFS that need to be copied to the destination. This step runs on a single host (where the process was launched).
- Copy mapper: After copy listing, an MR job is launched which copies the files from the list generated in the copy listing phase. This step runs across multiple hosts in the cluster as a MapReduce application, using a specified resource pool.
- Reducer: This process runs at the end of the copy mapper phase, and generates aggregate statistics. It also deletes files on the destination cluster that were already deleted on the source.
- Copy committer: Runs after the reducer phase completes and copies additional information related to HDFS permissions, ACLs, and extended attributes.
Hive and Apache Impala (incubating) do not natively have a replication mechanism to transfer data consistently across clusters. Cloudera Enterprise BDR addresses this gap by providing a solution for consistently replicating data and metadata between clusters.
Hive replication consists of a sequence of steps that perform consistent replication of Hive and Impala metadata and data from the source cluster to the destination, including:.
- Export metadata from the source cluster (databases, tables, partitions, indexes, impala user-defined function definitions).
- Validate metadata for compatibility against destination cluster.
- Replicate the corresponding HDFS data from source cluster to destination cluster.
- Import metadata into the destination cluster.
Improving Cluster Utilization with Bidirectional Replication
Some other vendors allow you to replicate all of your data only in a single direction between multiple Hadoop-compatible filesystems.
However, with HDFS replication, different files and directories can be replicated independently between clusters, allowing two HDFS clusters to effectively replicate each other’s data. Thus, you can use both clusters in production (with each backing the other up), while still achieving the redundancy of having copies of all data in two or more data centers.
Similarly, with Hive replication, different tables and databases can be independently replicated. This allows two production CDH clusters running Hive to effectively back each other up, as long as the table/database spaces on each cluster are uniquely sectioned off (disjoint sets).
Furthermore, you may replicate at the cadence and throughput you desire with full API control on when, and at what rate, replication is performed (and of which CPU resources are to be used).
HDFS Snapshots and Snapshot Policies
HDFS provides a mechanism for versioning content through HDFS snapshots. BDR allows you to generate snapshots at a frequency of your choosing via snapshot policies.
HDFS snapshots need to be first enabled for a directory and its descendants via the HDFS file browser. (For more information, consult Cloudera documentation here.)
As mentioned in the introduction, snapshot policies provide a fully automated way to version and store history of your HDFS data. This feature allows you to store versions of your most important data over a period of time and to control retention policies of those versions, to help with rapid recovery, and to quickly revert harmful or accidental changes to your data. Snapshot policies allow for creating snapshots at the directory where snapshots have been enabled, at a frequency defined by the user.
Using HDFS Snapshots for Data Versioning
HDFS snapshots provide a convenient way to version and retain the data critical for your Hadoop applications as well as for legal compliance.
- Cloudera recommends configuring snapshot policies for all data directories, including directories where your databases (Hive, Impala, Apache HBase, and so on) are presently managing their data, as well as any user-level data (home directories, outputs of production MR jobs that are critical for business continuity) that needs to be retained in the event of accidental deletion.
- Many organizations have data retention policies that might govern the cadence at which you might have snapshots be taken automatically, via Cloudera Manager.
- Cloudera Manager provides a convenient mechanism for restoring data from snapshots through the HDFS File Browser UI.
When to Use Snapshots
As explained in the table below, HDFS snapshots and HDFS replication differ significantly with respect to use case and expense.
How Cloudera Enterprise BDR Complements Native Hadoop Tooling
Cloudera has introduced quite a few enhancements on top of Hadoop’s native tooling (such as Distcp) to make Cloudera Enterprise BDR’s HDFS replication suitable for enterprise-class applications. As an added bonus, Cloudera Enterprise BDR development directly influences the feature roadmaps of Hive, HDFS, and Impala by providing an end-to-end integrated experience (for example, by replicating Impala writes seamlessly).
- Consistency guarantees via HDFS snapshots
HDFS snapshots are taken automatically during the copy listing step as snapshot-able paths are automatically discovered on the source cluster; snapshotted paths then replace the original source paths specified when copies are performed, ensuring a consistent, stable source for performing the copy.
- Scheduled execution support
Cloudera Manager provides a convenient replication schedule mechanism that models many of the common useful options for replication. There’s no need to configure or manage your own cron-job or use a scheduler.
- Support for multiple Kerberos realms
In secure clusters, Cloudera Enterprise BDR supports source and destination clusters to be in different, untrusted Kerberos realms. It achieves that by enabling explicit trust to be granted between the peer Cloudera Manager instances managing each cluster, transparently caching the Kerberos credentials from both clusters as part of the replication process. (For more details, see the docs.)
- Constant-space parallelized copy listing
Cloudera supports using up to 128 concurrent threads (defaulting to 20) to perform the copy listing step in parallel, utilizing the availability of multiple cores on modern Hadoop hardware. This significantly reduces the run time for the copy listing phase. (Distcp has a similar enhancement but does not optimize for disk space or memory consumption.)
- Replication between clusters on different Hadoop versions (including different major versions i.e. CDH4/CDH5)
Cloudera Enterprise BDR enables many complex replication scenarios across disparate Hadoop clusters running in a variety of environments with different CDH + Cloudera Manager versions. (Note: restrictions apply, including API-based usage with different CDH versions. Please consult the full list of supported scenarios here.)
- Selective path exclusion
You can easily eliminate copies of low value data, including temporary data (such as .Trash) from your replica with regular expression-based path exclusions. Copying this data can adversely affect storage and replication performance and affect your RPO/RTO guarantees
New Experimental Features
In addition to the enhancements detailed above, Cloudera Enterprise BDR also contains additional features that work in many environments to provide significant performance benefits, but are not generally applicable in all contexts. (Important note: Please contact your Cloudera support contact or solution architect prior to using any experimental features!)
- Snapshot Diff-based copy listing
This feature utilizes the difference between two HDFS snapshots (the current and the one created after the last run) to reduce the number of files scanned by the copy listing phase. This feature is useful when a significant portion of the files to be replicated do not change between scheduled replication runs.
- Dynamic chunking enhancements
Dynamic chunking is used by Cloudera Enterprise BDR to create “chunks” of copy tasks for different mappers to copy data. There are two enhancements related to chunking that might be useful in production environments and allow dynamically creating chunks of different sizes.
- Chunking by size
Files are combined into chunks by size, allowing Cloudera Enterprise BDR to assign chunks to mappers based on the total size of the chunk—which leads to more uniform distribution of chunks. This feature is useful when large number of files (>1 million for example) need to be copied and the variation in file sizes is significant.
- Chunking by file count
Cloudera Enterprise BDR allows you to specify files-per-chunk directly, as opposed to pre-tuning a replication to a certain static configuration. This is useful when large number of files (>1 million for example) need to be copied, and you are observing long-tail behavior with a few mappers in your BDR replication MapReduce job acting as stragglers that delay the entire job.
- Append-only copies
This feature detects files with appended blocks and copies only the “new” blocks from the source to the destination cluster. This feature is useful when a significant percentage of files are appended at the “tail” of the file”, and for very large files that get small amounts of data appended. Note that this feature will not function correctly if the files containing blocks were appended with blocks using a CRC algorithm that does not match the CRC algorithm that was used when creating the file.
After reading this post, you should have a good understanding of the Cloudera Enterprise BDR design principles, its differentiating functionality for enterprise production environments, and how that functionality complements the native open source tooling in Hadoop. To summarize that differentiating functionality:
- With Cloudera Enterprise BDR, you have full API control over when, and at what rate, replication is performed. You also have tight control over network and CPU resources utilized for replication through the MapReduce framework.
- Cloudera Enterprise BDR is the only Cloudera-supported solution for Hive metadata replication, and the ability to replicate Hive metadata consistently across clusters is critical for most Cloudera customers.
- Cloudera Enterprise BDR supports seamless integration with HDFS transparent encryption with support for replicating data in encryption zones. Additionally, configuring HDFS RPC encryption guarantees that your data is never in cleartext either at the source cluster, or the destination cluster, or while in transfer.
- Cloudera Enterprise BDR supports replication at scale: at hundreds of terabytes under real production workloads.
Jayesh Seshadri is the Technical Lead for Cloudera Backup and Disaster Recovery; he joined Cloudera since 2014.