Why Replicating HBase Data Using Replication Manager is the Best Choice

Why Replicating HBase Data Using Replication Manager is the Best Choice

Various methods to replicate HBase data including Replication Manager

In this article we discuss the various methods to replicate HBase data and explore why Replication Manager is the best choice for the job with the help of a use case.

Cloudera Replication Manager is a key Cloudera Data Platform (CDP) service, designed to copy and migrate data between environments and infrastructures across hybrid clouds. The service provides simple, easy-to-use, and feature-rich data movement capability to deliver data and metadata where it is needed, and has secure data backup and disaster recovery functionality.

Apache HBase is a scalable, distributed, column-oriented data store that provides real-time read/write random access to very large datasets hosted on Hadoop Distributed File System (HDFS). In CDP’s Operational Database (COD) you use HBase as a data store with HDFS and/or Amazon S3/Azure Blob Filesystem (ABFS) providing the storage infrastructure. 

What are the different methods available to replicate HBase data?

You can use one of the following methods to replicate HBase data based on your requirements:

Methods Description When to use
Replication Manager

In this method, you create HBase replication policies to migrate HBase data.

The following list consolidates all the minimum supported versions of source and target cluster combinations for which you can use HBase replication policies to replicate HBase data:

  • From CDP 7.1.6 using CM 7.3.1 to CDP 7.2.14 Data Hub using CM 7.6.0
  • From CDH 6.3.3 using CM 7.3.1 to CDP 7.2.14 Data Hub using CM 7.6.0
  • From CDH 5.16.2 using CM 7.4.4 (patch-5017) to COD 7.2.14
  • From COD 7.2.14 to COD 7.2.14
When the source cluster and target cluster meet the  requirements of supported use cases. See caveats.

See support matrix for more information. 

Operational Database Replication plugin for cluster versions that Replication Manager does not support.

The plugin allows you to migrate your HBase data from CDH or HDP to COD CDP Public Cloud. In this method, you prepare the data for migration, and then set up the replication plugin to use a snapshot to migrate your data.

The following list consolidates all the minimum supported versions of source and target cluster combinations for which you can use the replication plugin to replicate HBase data:

  • From CDH 5.10 using CM 6.3.0 to CDP Public Cloud on AWS
  • From CDH 5.10 using CM 6.3.4 to CDP Public Cloud on Azure
  • From CDH 6.1 using CM 6.3.0 to CDP Public Cloud on AWS
  • From CDH 6.1 using CM 7.1.1/6.3.4 to CDP Public Cloud on Azure
  • CDP 7.1.1 using CM 7.1.1 to CDP Public Cloud on AWS and Azure
  • HDP 2.6.5 and HDP 3.1.1 to CDP Public Cloud on AWS and Azure
For information about use cases that are not supported by Replication Manager, see support matrix.
Using replication-related HBase commands

Important: It is recommended that you use Replication Manager. Use the replication plugin for the unsupported cluster versions to replicate HBase data.

High-level steps include:

  1. Prepare source and target clusters.
  2. Enable replication on source cluster Cloudera Manager.
  3. Use HBase shell to add peers and configure each required column family.

Optionally, verify whether the replication operation is successful and the validity of the replicated data.

HBase data is in an HBase cluster and you want to move it to another HBase cluster. 

 

HBase is used across domains and enterprises for a wide variety of business use cases, which enables it to be used in disaster recovery use cases as well, ensuring that it plays an important role in maintaining business continuity. Replication Manager provides HBase replication policies that help with disaster recovery so you can be assured that the data is backed up (as it gets generated), guaranteeing that you use the required and latest data in your business analytics and other use cases. Even though you can use HBase commands or the Operational Database replication plugin to replicate data, it would not be a feasible solution in the long run.

HBase replication policies also provide an option called Perform Initial Snapshot. When you choose this option, the existing data and the data generated after policy creation gets replicated. Otherwise, the policy replicates to-be-generated HBase data only. You can use this option when there is a space crunch on your backup cluster, or if you have already backed up the existing data. 

You can replicate HBase data from a source classic cluster (CDH or CDP Private Cloud Base cluster), COD, or Data Hub to a target Data Hub or COD cluster using Replication Manager. 

Example use case

This use case discusses how using Replication Manager to replicate HBase data from a CDH cluster to a CDP Operational Database (COD) cluster assures a low-cost and low-maintenance strategy in the long run as compared to the other methods. It also captures some observations and key takeaways that might help you while implementing similar scenarios. 

For example: You are using a CDH cluster as the disaster recovery (DR) cluster for HBase data. You now want to use COD service on CDP as your DR cluster and want to migrate the data to it. You have around 6,000 tables to migrate from the CDH cluster to the COD cluster. 

Before you initiate this task, you want to understand the best approach that will assure you a low cost and low maintenance implementation of this use case in the long run. You also want to understand the estimated time to complete this task, and the benefits of using COD. 

The following issues might appear if you try to migrate all 6000 tables using a single HBase replication policy:

  • If a table replication in the policy fails, you might have to create another policy to start the process all over again. This is because previously copied files get overwritten, resulting in loss of time and network bandwidth. 
  • It can take a significant amount of time to completepotentially weeks depending on the data.
  • It might consume additional time to replicate the accumulated data. 
  • The accumulated data is the new/changed data on the source cluster after the replication policy starts. 

For example, a policy is created at T1 (timestamp)HBase replication policies use HBase snapshots to replicate HBase dataand it uses the snapshot taken at T1 to replicate. Any data that is generated in the source cluster after T1 is accumulated data. 

The best approach to resolve this issue is to use the incremental approach. In this approach, you replicate data in batches. For example, 500 tables at a time. This approach ensures that the source cluster is healthy because you replicate data in small batches. COD uses S3, which is a cost-saving option compared to other storage available on the cloud. Replication Manager not only ensures that all the HBase data and accumulated data in a cluster is replicated, but also that accumulated data is replicated automatically without user intervention. This yields reliable data replication and lowers maintenance requirements.

The following steps explain the incremental approach in detail:

1- You create an HBase replication policy for the first 500 tables.

  • Internally, Replication Manager performs the following steps:
  • Disables the HBase peer and then adds it to the source cluster at T1. 
  • Simultaneously creates a snapshot at T1 and copies it to the target cluster. 
  • HBase replication policies use snapshots to replicate HBase data; this step ensures that all data existing prior to T1 is replicated.
  • Restores the snapshot to appear as the table on the target. 
  • This step ensures the data till T1 is replicated to the target cluster.
  • Deletes the snapshot. 
  • The Replication Manager performs this step after the replication is successfully complete.
  • Enables table’s replication scope for replication. 
  • Enables the peer. 
  • This step ensures that data that accumulated after T1 is completely replicated. 

Important: After all the accumulated data is migrated, the Replication Manager continues to replicate new/changed data in this batch of tables automatically.

2- Create another HBase replication policy to replicate the next batch of 500 tables after all the existing data and accumulated data of the first batch of tables is migrated successfully.

3- You can continue this process until all the tables are replicated successfully.

In an ideal scenario, the time taken to replicate 500 tables of 6 TB size might take around four to five hours, and the time taken to replicate the accumulated data might be another 30 minutes to one and a half hours, depending on the speed at which the data is being generated on the source cluster. Therefore, this approach uses 12 batches and around four to five days to replicate all the 6000+ tables to COD.

The cluster specifications that was used for this use case:

  • Primary cluster: CDH 5.16.2 cluster using CM 7.4.3located in an on-premises Cloudera data center with:
    • 10 node clusters (contains a maximum of 10 workers)
    • 6 TB of disks/node
    • 1000 tables (12.5 TB size, 18000 regions)
  • Disaster recovery (DR) cluster: CDP Operational Database (COD) 7.2.14 using CM 7.5.3 on Amazon S3 with:
    • 5 workers (m5.2x large Amazon EC2 instance)
    • 0.5 TB disk/node
    • US-west region
    • No Multi-AZ deployment
    • No Ephemeral storage

Perform the following steps to complete the replication job for this use case: 

1- In the Management Console, add the CDH cluster as a classic cluster

This step assumes that you have a valid registered AWS environment in CDP Public Cloud.

2- In the Operational Database, create a COD cluster. The cluster uses Amazon S3 as cloud object storage. 

3- In the Replication Manager, create a HBase replication policy and specify the required CDH cluster and COD as source and destination cluster respectively.

The observed time taken to complete replication was approximately four hours for 500 tables, where six TB size was used in each batch. The job used 100 parallel factor and 1800 yarn containers

The estimated time taken to complete the internal tasks by Replication Manager to replicate a batch of 500 tables in this use case was:

  • ~160 minutes to complete tasks on the source cluster, which includes creating and exporting snapshots (tasks run in parallel) and altering table column families.
  • ~77 minutes to complete the tasks on the target cluster, which includes creating, restoring, and deleting snapshots (tasks run in parallel).

Note that these statistics are not visible or available to a Replication Manager user. You can only view the overall total time spent by the replication policy on the Replication Policies page.

The following table lists the record size in the replicated HBase table, the COD size in nodes, and its projected write throughput in rows/second of COD, data written/day, and replication throughput in rows/second of Replication Manager for a full-scale COD DR cluster:

Record size COD size in nodes Writes throughput (rows/sec) Data written/day Replication throughput (rows/sec)
1.2KB 125 700k/sec 71TB/day 350k/sec
0.6KB 125 810k/sec 43TB/day 400k/sec

 

Observations and key takeaways

Observations:

  • SSDs(gp2) didn’t have much impact on write workload performance as compared to HDDs (standard magnetic).
  • The network/S3 throughput achieved a maximum of 700-800 MB/sec even with increased parallelismwhich could be a bottleneck for the throughput.

Key takeaways:

  • Replication Manager works efficiently to set up replication of 6,000 tables in an incremental approach.
  • In the use case, 125 nodes wrote approximately 70 TB of data in a day. The write throughput of the COD cluster wasn’t affected by the S3 latency (which is cloud object storage of COD) and resulted in at least 30% cost saving by avoiding instances that require a large number of disks. 
  • The time to operationalize the database in another form factor, like high-performance storage instead of S3, was approximately four and a half hours. The operational time taken includes setting up the new COD cluster with high-performance storage, and to copy 60 TB of data from S3 on HDFS. 

Conclusion

With the right strategy, Replication Manager assures that the data replication is efficient and reliable in multiple use cases. This use case shows how using Replication Manager and creating smaller batches to replicate data saves time and resources, which also means that if any issue crops up troubleshooting is faster. Using COD on S3 also led to higher cost saving, and using Replication Manager meant that the service would take care of initial setup with few clicks and ensure that new/changed data is automatically replicated without any user intervention. Note that this is not feasible with the Cloudera Replication Plugin, or the other methods, because it involves several steps to migrate HBase data, and accumulated data is not replicated automatically.

Therefore Replication Manager can be your go-to replication tool whenever a need to replicate or migrate data appears in your CDH or CDP environments because it is not just easy to use, it also ensures efficiency and lowers operational costs to a large extent. 

If you have more questions, visit our documentation portal for information. If you need help to get started, contact our Cloudera Support team. 

References

Special Acknowledgements: Asha Kadam, Andras Piros

Lavanya Subash
Senior Technical Writer
More by this author
Zsuzsanna Szabo
Staff Software Engineer
More by this author
Ankit Singhal
Senior Staff Engineer
More by this author

Leave a comment

Your email address will not be published. Links are not permitted in comments.