How-to: Automate Replications with Cloudera Manager API

by Nachiket Vaidya

Posted in Technical | June 28, 2018 4 min read

Cloudera Enterprise Backup and Disaster Recovery (BDR) enables you to replicate data across data centers for disaster recovery scenarios. As a lower cost solution to geographical redundancy or as a means to perform an on-premises to cloud migration, BDR can also replicate HDFS and Hive data to and from Amazon S3 or a Microsoft Azure Data Lake Store.

Many customers may require an automated solution for creating, running, and managing replication schedules in order to minimize Recovery Point Objectives (RPOs) for late arriving data or to automate recovery after disaster recovery. This blog post walks through an example of automating HDFS replication by creating, running and managing BDR using the Cloudera Manager (CM) API.

Prerequisite

We strongly recommend you read the blog about How-to: Automate Your Cluster with Cloudera Manager API to learn the basics of the Cloudera Manager API.

Additionally, familiarize yourself with the Python client and the full API documentation.

Setting up a HDFS replication schedule

In this blog, we demonstrate use of the API via a Python script that creates, runs and manages an HDFS replication schedule. Cloudera BDR replications are pull-based replications, i.e. we need to create them in the target environment.

Step 1: Creating a peer

Before creating a replication schedule you need a peer, a source CM, from which data will be pulled. Please see the documentation for more information.

To create a peer:

#!/usr/bin/env python

from cm_api.api_client import ApiResource
from cm_api.endpoints.types import *
 
TARGET_CM_HOST = "tgt.cm.cloudera.com"
SOURCE_CM_URL = "http://src.cm.cloudera.com:7180/"
 
api_root = ApiResource(TARGET_CM_HOST, username="username", password="password")
cm = api_root.get_cloudera_manager()
cm.create_peer("peer1", SOURCE_CM_URL, 'username', 'password')

The above sample creates an API root handle and gets a Cloudera Manager instance from it. We use the Cloudera Manager API to create a peer. For creating a peer, you need to provide the source CM URL and username/password of the admin user on the source CM.

Step 2: Creating an HDFS replication schedule

Now, you are ready to create an HDFS replication schedule.

To create an HDFS replication schedule:

PEER_NAME='peer1'
SOURCE_CLUSTER_NAME='Cluster-src-1'
SOURCE_HDFS_NAME='HDFS-src-1'
TARGET_CLUSTER_NAME='Cluster-tgt-1'
TARGET_HDFS_NAME='HDFS-tgt-1'
TARGET_YARN_SERVICE='YARN-1'

hdfs = api_root.get_cluster(TARGET_CLUSTER_NAME).get_service(TARGET_HDFS_NAME)

hdfs_args = ApiHdfsReplicationArguments(None)
hdfs_args.sourceService = ApiServiceRef(None,
                                peerName=PEER_NAME,
                                clusterName=SOURCE_CLUSTER_NAME,
                                serviceName=SOURCE_HDFS_NAME)
hdfs_args.sourcePath = '/src/path/'
hdfs_args.destinationPath = '/target/path'
hdfs_args.mapreduceServiceName = TARGET_YARN_SERVICE

# creating a schedule with daily frequency
start = datetime.datetime.now() # The time at which the scheduled activity is triggered for the first time.
end = start + datetime.timedelta(days=365) # The time after which the scheduled activity will no longer be triggered.

schedule = hdfs.create_replication_schedule(start, end, "DAY", 1, True, hdfs_args)

We create an ApiHdfsReplicationArguments and populate important attributes, such as source path, destination name, mapreduce service to use, etc. For the source service, you will need to provide the HDFS service name and cluster name on the source CM. Please read the API documentation for the complete list of attributes for ApiHdfsReplicationArguments.

We then use hdfs_args to create an HDFS replication schedule.

Step 3: Running an HDFS replication schedule

The replication schedule created in step 2 has a frequency of 1 DAY, so the schedule will run at the initial start time every day. You can also manually run the schedule using the following:

cmd = hdfs.trigger_replication_schedule(schedule.id)

Step 4: Monitoring the schedule

Once you get a command (cmd), you can wait for the command to finish and then get the results:

cmd = cmd.wait()
result = hdfs.get_replication_schedule(schedule.id).history[0].hdfsResult

Managing the schedules

Once you get replication schedules setup, you can manage them with the Cloudera Manager API too.

Get all replication schedules for a given service:

schs = hdfs.get_replication_schedules()

Get a given replication schedule by schedule id for a given service:

sch = hdfs.get_replication_schedule(schedule_id)

Delete a given replication schedule by schedule id for a given service:

sch = hdfs.delete_replication_schedule(schedule_id)

Update a given replication schedule by schedule id for a given service:

sch.hdfsArguments.removeMissingFiles = True
sch = hdfs.update_replication_schedule(sch.id, sch)

Setting up cloud replications

BDR also supports HDFS to/from S3 or ADLS replication using the Cloudera Manager API.

To use this functionality, you must first specify the AWS or Azure account. To add the account, do the following:

ACCESS_KEY="...."
SECRET_KEY="...."
TYPE_NAME = 'AWS_ACCESS_KEY_AUTH'
                                         
account_configs ={'aws_access_key': ACCESS_KEY,                      
   'aws_secret_key': SECRET_KEY}                  

cm.api.create_external_account("cloudAccount1",
    "cloudAccount1",
    TYPE_NAME,
    account_configs=account_configs)

When creating a cloud (S3/ADLS) replication schedule, instead of a peer, you need to specify the cloud account that the replication schedule will use.

CLUSTER_NAME='Cluster-tgt-1'
HDFS_NAME='HDFS-tgt-1'
CLOUD_ACCOUNT='cloudAccount1'
YARN_SERVICE='YARN-1'

hdfs = api_root.get_cluster(CLUSTER_NAME).get_service(HDFS_NAME)

hdfs_cloud_args = ApiHdfsCloudReplicationArguments(None)
hdfs_cloud_args.sourceService = ApiServiceRef(None,
                                peerName=None,
                                clusterName=CLUSTER_NAME,
                                serviceName=HDFS_NAME)
hdfs_cloud_args.sourcePath = '/src/path'
hdfs_cloud_args.destinationPath = 's3a://bucket/target/path/'
hdfs_cloud_args.destinationAccount = CLOUD_ACCOUNT
hdfs_cloud_args.mapreduceServiceName = YARN_SERVICE

# creating a schedule with daily frequency
start = datetime.datetime.now() # The time at which the scheduled activity is triggered for the first time.
end = start + datetime.timedelta(days=365) # The time after which the scheduled activity will no longer be triggered.

schedule = hdfs.create_replication_schedule(start, end, "DAY", 1, True, hdfs_args)

In this example, we create an ApiHdfsCloudReplicationArguments, populated it, and created an HDFS to S3 backup schedule. In addition to specifying important attributes such as source path, destination path, etc., we provided a destinationAccount as CLOUD_ACCOUNT and peerName as None in sourceService. The peerName is None since there is no peer for cloud replication schedules. We then use hdfs_cloud_args to create a HDFS-S3 replication schedule. Note that the triggering and monitoring of the cloud replication schedule is the same as for an HDFS replication schedule described in the HDFS replication schedule from step 2. .

Debugging failures during replication

If a replication job fails, you can download replication diagnostic data for the replication command to troubleshoot and diagnose any issues.

The diagnostic data includes all the logs generated, including the MapReduce logs. You can also upload the logs to a support case for further analysis. Collecting a replication diagnostic bundle is available for API v11+ and Cloudera Manager version 5.5+.

To collect a replication diagnostic bundle:

args = {}
resp = hdfs.collect_replication_diagnostic_data(schedule_id=schedule.id, args)

# Download replication diagnostic bundle to a temp directory
tmpdir = tempfile.mkdtemp(prefix="support-bundle-replication")
support_bundle_path = os.path.join(tmpdir, "support-bundle.zip")
cm.download_from_url(resp.resultDataUrl, support_bundle_path)

Conclusion

Automating BDR replications using CM API is very powerful and especially useful if you have to manage a large number of replications. You can easily add this in your Oozie workflow and trigger/monitor replications. The BDR team’s internal testing framework uses this process, so it is a well tested one. Happy automation!

Nachiket Vaidya

More by this author

Editor's Choice

Business

Generative AI for the Enterprise

Technical

Building Trust in Public Sector AI Starts with Trusting Your Data

2 Comments

by Satyajyoti Dash on Oct 21, 2019 @ 7:28 am EDT

How to get below details while creating HDFS replication.

SOURCE_CLUSTER_NAME=’Cluster-src-1′
SOURCE_HDFS_NAME=’HDFS-src-1′
TARGET_CLUSTER_NAME=’Cluster-tgt-1′
TARGET_HDFS_NAME=’HDFS-tgt-1′
TARGET_YARN_SERVICE=’YARN-1′

by Nachiket Vaidya on May 18, 2020 @ 9:51 pm EDT

CM deployment API will give you the name of the services
https://archive.cloudera.com/cm6/6.0.0/generic/jar/cm_api/apidocs/resource_ClouderaManagerResource.html#resource_ClouderaManagerResource_ClouderaManagerResourceV30_getDeployment_GET

You need to run the above API on the source as well on the target to get the names of the services.