Cloudera Enterprise Backup and Disaster Recovery (BDR) enables you to replicate data across data centers for disaster recovery scenarios. As a lower cost solution to geographical redundancy or as a means to perform an on-premises to cloud migration, BDR can also replicate HDFS and Hive data to and from Amazon S3 or a Microsoft Azure Data Lake Store.
Many customers may require an automated solution for creating, running, and managing replication schedules in order to minimize Recovery Point Objectives (RPOs) for late arriving data or to automate recovery after disaster recovery. This blog post walks through an example of automating HDFS replication by creating, running and managing BDR using the Cloudera Manager (CM) API.
Prerequisite
We strongly recommend you read the blog about How-to: Automate Your Cluster with Cloudera Manager API to learn the basics of the Cloudera Manager API.
Additionally, familiarize yourself with the Python client and the full API documentation.
Setting up a HDFS replication schedule
In this blog, we demonstrate use of the API via a Python script that creates, runs and manages an HDFS replication schedule. Cloudera BDR replications are pull-based replications, i.e. we need to create them in the target environment.
Step 1: Creating a peer
Before creating a replication schedule you need a peer, a source CM, from which data will be pulled. Please see the documentation for more information.
To create a peer:
#!/usr/bin/env python from cm_api.api_client import ApiResource from cm_api.endpoints.types import * TARGET_CM_HOST = "tgt.cm.cloudera.com" SOURCE_CM_URL = "http://src.cm.cloudera.com:7180/" api_root = ApiResource(TARGET_CM_HOST, username="username", password="password") cm = api_root.get_cloudera_manager() cm.create_peer("peer1", SOURCE_CM_URL, 'username', 'password')
The above sample creates an API root handle and gets a Cloudera Manager instance from it. We use the Cloudera Manager API to create a peer. For creating a peer, you need to provide the source CM URL and username/password of the admin user on the source CM.
Step 2: Creating an HDFS replication schedule
Now, you are ready to create an HDFS replication schedule.
To create an HDFS replication schedule:
PEER_NAME='peer1' SOURCE_CLUSTER_NAME='Cluster-src-1' SOURCE_HDFS_NAME='HDFS-src-1' TARGET_CLUSTER_NAME='Cluster-tgt-1' TARGET_HDFS_NAME='HDFS-tgt-1' TARGET_YARN_SERVICE='YARN-1' hdfs = api_root.get_cluster(TARGET_CLUSTER_NAME).get_service(TARGET_HDFS_NAME) hdfs_args = ApiHdfsReplicationArguments(None) hdfs_args.sourceService = ApiServiceRef(None, peerName=PEER_NAME, clusterName=SOURCE_CLUSTER_NAME, serviceName=SOURCE_HDFS_NAME) hdfs_args.sourcePath = '/src/path/' hdfs_args.destinationPath = '/target/path' hdfs_args.mapreduceServiceName = TARGET_YARN_SERVICE # creating a schedule with daily frequency start = datetime.datetime.now() # The time at which the scheduled activity is triggered for the first time. end = start + datetime.timedelta(days=365) # The time after which the scheduled activity will no longer be triggered. schedule = hdfs.create_replication_schedule(start, end, "DAY", 1, True, hdfs_args)
We create an ApiHdfsReplicationArguments
and populate important attributes, such as source path, destination name, mapreduce service to use, etc. For the source service, you will need to provide the HDFS service name and cluster name on the source CM. Please read the API documentation for the complete list of attributes for ApiHdfsReplicationArguments
.
We then use hdfs_args
to create an HDFS replication schedule.
Step 3: Running an HDFS replication schedule
The replication schedule created in step 2 has a frequency of 1 DAY, so the schedule will run at the initial start time every day. You can also manually run the schedule using the following:
cmd = hdfs.trigger_replication_schedule(schedule.id)
Step 4: Monitoring the schedule
Once you get a command (cmd), you can wait for the command to finish and then get the results:
cmd = cmd.wait() result = hdfs.get_replication_schedule(schedule.id).history[0].hdfsResult
Managing the schedules
Once you get replication schedules setup, you can manage them with the Cloudera Manager API too.
Get all replication schedules for a given service:
schs = hdfs.get_replication_schedules()
Get a given replication schedule by schedule id for a given service:
sch = hdfs.get_replication_schedule(schedule_id)
Delete a given replication schedule by schedule id for a given service:
sch = hdfs.delete_replication_schedule(schedule_id)
Update a given replication schedule by schedule id for a given service:
sch.hdfsArguments.removeMissingFiles = True sch = hdfs.update_replication_schedule(sch.id, sch)
Setting up cloud replications
BDR also supports HDFS to/from S3 or ADLS replication using the Cloudera Manager API.
To use this functionality, you must first specify the AWS or Azure account. To add the account, do the following:
ACCESS_KEY="...." SECRET_KEY="...." TYPE_NAME = 'AWS_ACCESS_KEY_AUTH' account_configs ={'aws_access_key': ACCESS_KEY, 'aws_secret_key': SECRET_KEY} cm.api.create_external_account("cloudAccount1", "cloudAccount1", TYPE_NAME, account_configs=account_configs)
When creating a cloud (S3/ADLS) replication schedule, instead of a peer, you need to specify the cloud account that the replication schedule will use.
CLUSTER_NAME='Cluster-tgt-1' HDFS_NAME='HDFS-tgt-1' CLOUD_ACCOUNT='cloudAccount1' YARN_SERVICE='YARN-1' hdfs = api_root.get_cluster(CLUSTER_NAME).get_service(HDFS_NAME) hdfs_cloud_args = ApiHdfsCloudReplicationArguments(None) hdfs_cloud_args.sourceService = ApiServiceRef(None, peerName=None, clusterName=CLUSTER_NAME, serviceName=HDFS_NAME) hdfs_cloud_args.sourcePath = '/src/path' hdfs_cloud_args.destinationPath = 's3a://bucket/target/path/' hdfs_cloud_args.destinationAccount = CLOUD_ACCOUNT hdfs_cloud_args.mapreduceServiceName = YARN_SERVICE # creating a schedule with daily frequency start = datetime.datetime.now() # The time at which the scheduled activity is triggered for the first time. end = start + datetime.timedelta(days=365) # The time after which the scheduled activity will no longer be triggered. schedule = hdfs.create_replication_schedule(start, end, "DAY", 1, True, hdfs_args)
In this example, we create an ApiHdfsCloudReplicationArguments
, populated it, and created an HDFS to S3 backup schedule. In addition to specifying important attributes such as source path, destination path, etc., we provided a destinationAccount
as CLOUD_ACCOUNT
and peerName
as None
in sourceService
. The peerName is None
since there is no peer for cloud replication schedules. We then use hdfs_cloud_args
to create a HDFS-S3 replication schedule. Note that the triggering and monitoring of the cloud replication schedule is the same as for an HDFS replication schedule described in the HDFS replication schedule from step 2. .
Debugging failures during replication
If a replication job fails, you can download replication diagnostic data for the replication command to troubleshoot and diagnose any issues.
The diagnostic data includes all the logs generated, including the MapReduce logs. You can also upload the logs to a support case for further analysis. Collecting a replication diagnostic bundle is available for API v11+ and Cloudera Manager version 5.5+.
To collect a replication diagnostic bundle:
args = {} resp = hdfs.collect_replication_diagnostic_data(schedule_id=schedule.id, args) # Download replication diagnostic bundle to a temp directory tmpdir = tempfile.mkdtemp(prefix="support-bundle-replication") support_bundle_path = os.path.join(tmpdir, "support-bundle.zip") cm.download_from_url(resp.resultDataUrl, support_bundle_path)
Conclusion
Automating BDR replications using CM API is very powerful and especially useful if you have to manage a large number of replications. You can easily add this in your Oozie workflow and trigger/monitor replications. The BDR team’s internal testing framework uses this process, so it is a well tested one. Happy automation!
How to get below details while creating HDFS replication.
SOURCE_CLUSTER_NAME=’Cluster-src-1′
SOURCE_HDFS_NAME=’HDFS-src-1′
TARGET_CLUSTER_NAME=’Cluster-tgt-1′
TARGET_HDFS_NAME=’HDFS-tgt-1′
TARGET_YARN_SERVICE=’YARN-1′
CM deployment API will give you the name of the services
https://archive.cloudera.com/cm6/6.0.0/generic/jar/cm_api/apidocs/resource_ClouderaManagerResource.html#resource_ClouderaManagerResource_ClouderaManagerResourceV30_getDeployment_GET
You need to run the above API on the source as well on the target to get the names of the services.