At Cloudera, we have long believed that automation is key to delivering secure, ready-to-use, and well-configured platforms. Hence, we were pleased to announce the public release of Ansible-based automation to deploy CDP Private Cloud Base. By automating cluster deployment this way, you reduce the risk of misconfiguration, promote consistent deployments across multiple clusters in your environment, and help to deliver business value more quickly.
This blog will walk through how to deploy a Private Cloud Base cluster, with security, with a minimum of human interaction.
“The most powerful tool we have as developers is automation.” — Scott Hanselman
Once we’ve set up the configuration files and automation environment, Ansible will build and configure the cluster without intervention. In the following sections, we will cover:
- Setting up the automation environment (the “runner”).
- Configuring Credentials (or accepting a trial licence).
- Defining the cluster you want built.
- Setting up your inventory of hosts (dynamic inventories or static inventories).
- Running the playbook.
We have two options for setting up your execution environment (also known as the “runner”). We can run the quickstart environment, which is a Docker container we can run locally or within a pipeline, or we can install the dependencies on a Linux machine in our data center infrastructure. The Docker container includes all the required dependencies for local execution, and works on Linux, Windows or OSX.
If we’re running in docker, we can simply download and run the
quickstart.sh script, and this will launch our docker container for us:
wget https://raw.githubusercontent.com/cloudera-labs/cloudera-deploy/main/quickstart.sh && chmod +x quickstart.sh && ./quickstart.sh
Else, if we’re running outside of Docker, we will clone the
cloudera-deploy git repository and then run the
centos7-init.sh script which will install Ansible 2.10, Ansible galaxy collections, and their dependencies:
yum install git -y git clone https://github.com/cloudera-labs/cloudera-deploy.git /opt/cloudera-deploy cd /opt/cloudera-deploy && git checkout devel && chmod u+x centos7-init.sh && ./centos7-init.sh
You can run without any credentials, but ideally we’ll set up a profile file that can contain paths to cloud credentials (if deploying on public cloud) and to your CDP license file (if you want to use one).
Copy the template
profile.yml file to
mkdir -p ~/.config/cloudera-deploy/profiles cp /opt/cloudera-deploy/profile.yml ~/.config/cloudera-deploy/profiles/default
In this file (
~/.config/cloudera-deploy/profiles/default), you can then specify a public/private keypair if required and your CDP licence file, plus a default password for Cloudera Manager:
admin_password: "MySuperSecretPassword1!" license_file: "~/.cdp/my_cloudera_license_2021.txt" public_key_file: "~/.ssh/mykey.pub" private_key_file: "~/.ssh/mykey.pem"
The following method describes deploying CDP Private Cloud onto physical or virtual machines. In some instances (perhaps development environments) it may be desirable to deploy CDP Private Cloud on EC2, Azure VMs or GCE however it should be noted that there are significant cost, performance and agility advantages to using CDP Public Cloud for any public-cloud workloads. This automation will allow for the creation of the requisite VMs to run your cluster on.
If you are running in GCE we can set up our GCP credentials in our profile file. If you are using VMs in Azure or AWS the Default credentials will be automatically collected from your local user profile (
.azure directories). We suggest you set your default
infra_type in your profile file to match your preferred default Public Cloud Infrastructure credentials, and check that your Default credentials point to the correct tenants.
#infra_type can be omitted, "aws", "azure" or "gcp". Defaults to aws infra_type: gcp gcloud_credential_file: '~/.config/gcloud/mycreds.json'
For CDP Private Cloud clusters, the cluster definition directory is where we are going to define:
- Cloudera Manager and Cluster versions
- Which services should run on the cluster
- Any configuration settings we wish to change from the defaults
- Any supporting infrastructure we need: internal or external certificate authorities, Kerberos Key Distribution Centers, provided or provisioned RDBMS (Postgres, MariaDB, or Oracle), parcel repositories, etc
- Which security features we wish to enable – Kerberos, TLS, HDFS Transparent Data Encryption, LDAP integration, etc.
The overriding principle is that you should never need to amend the playbooks or the collections – everything that you wish to customise should be customisable through the definition.
Our cluster definition will consist of three parts:
application.yml– this is just a placeholder file for any Ansible tasks you may wish to execute after Deployment
definition.yml– this holds our cluster definition content
inventory_template.ini– A traditional static, or modern dynamic, ‘Ansible Inventory’ of hosts to deploy to.
There is a basic definition file provided in the cloudera-deploy repository; however this only includes the HDFS, YARN, and Zookeeper services.
Let’s start by creating a definition directory:
mkdir /opt/cloudera-deploy/definitions cp -r /opt/cloudera-deploy/examples/sandbox /opt/cloudera-deploy/definitions/mydefinition echo yes | cp /opt/cloudera-deploy/roles/cloudera_deploy/defaults/basic_cluster.yml /opt/cloudera-deploy/definitions/mydefinition/definition.yml
We’ll populate the following sections in the
First of all we’ll set the Cloudera Manager Version – we’ll ideally use the latest version (7.3.1 at the time of writing if you are using your Cloudera License File in your Profile explained earlier, although 7.1.4 is the default if you’re using a trial license):
Next we’ll define our cluster:
clusters: - name: Data Engineering Cluster services: [ATLAS, DAS, HBASE, HDFS, HIVE, HIVE_ON_TEZ, HUE, IMPALA, INFRA_SOLR, KAFKA, OOZIE, RANGER, QUEUEMANAGER, SOLR, SPARK_ON_YARN, TEZ, YARN, ZOOKEEPER] repositories: # For licensed clusters: - https://archive.cloudera.com/p/cdh7/126.96.36.199/parcels/ # For trial clusters uncomment this line: # - https://archive.cloudera.com/cdh7/7.1.4/parcels/ security: kerberos: true configs: … host_templates: …
You can customise the list of services from the list of available services and roles defined in the collection itself. You can include in this section services such as Apache Spark 3, Apache NiFi or Apache Flink although these will require configuration of separate CSDs.
We can specify additional configs, grouped into roles, or for service-wide configs we can use the dummy role “
SERVICEWIDE“. Most configuration settings are set to sensible defaults, either by Cloudera Manager or the playbook itself, so you only need to set those which are specific to your environment.
configs: ATLAS: ATLAS_SERVER: atlas_authentication_method_file: true atlas_admin_password: password123 atlas_admin_username: admin HDFS: DATANODE: dfs_data_dir_list: /dfs/dn NAMENODE: dfs_name_dir_list: /dfs/nn SECONDARYNAMENODE: fs_checkpoint_dir_list: /dfs/snn IMPALA: IMPALAD: enable_audit_event_log: true scratch_dirs: /tmp/impala YARN: RESOURCEMANAGER: yarn_scheduler_maximum_allocation_mb: 4096 yarn_scheduler_maximum_allocation_vcores: 4 NODEMANAGER: yarn_nodemanager_resource_memory_mb: 4096 yarn_nodemanager_resource_cpu_vcores: 4 yarn_nodemanager_local_dirs: /tmp/nm yarn_nodemanager_log_dirs: /var/log/nm GATEWAY: mapred_submit_replication: 3 mapred_reduce_tasks: 6 ZOOKEEPER: SERVICEWIDE: zookeeper_datadir_autocreate: true
In the Host template section we will specify which roles will be assigned to each host template. In this simple cluster we only have two host templates:
. For more complex clusters you may wish to have more host templates. In the next section we will explain how these host templates are applied to cluster nodes.
host_templates: Master1: ATLAS: [ATLAS_SERVER] DAS: [DAS_EVENT_PROCESSOR, DAS_WEBAPP] HBASE: [MASTER, HBASERESTSERVER, HBASETHRIFTSERVER] HDFS: [NAMENODE, SECONDARYNAMENODE, HTTPFS] HIVE: [HIVEMETASTORE, GATEWAY] HIVE_ON_TEZ: [HIVESERVER2] HUE: [HUE_SERVER, HUE_LOAD_BALANCER] IMPALA: [STATESTORE, CATALOGSERVER] INFRA_SOLR: [SOLR_SERVER] OOZIE: [OOZIE_SERVER] QUEUEMANAGER: [QUEUEMANAGER_STORE, QUEUEMANAGER_WEBAPP] RANGER: [RANGER_ADMIN, RANGER_TAGSYNC, RANGER_USERSYNC] SPARK_ON_YARN: [SPARK_YARN_HISTORY_SERVER] TEZ: [GATEWAY] YARN: [RESOURCEMANAGER, JOBHISTORY] ZOOKEEPER: [SERVER] Workers: HBASE: [REGIONSERVER] HDFS: [DATANODE] HIVE: [GATEWAY] HIVE_ON_TEZ: [GATEWAY] IMPALA: [IMPALAD] KAFKA: [KAFKA_BROKER] SOLR: [SOLR_SERVER] SPARK_ON_YARN: [GATEWAY] TEZ: [GATEWAY] YARN: [NODEMANAGER]
Finally we will add any Cloudera Manager settings required, including any CSDs that might need to be installed for non-CDP services.
mgmt: name: Cloudera Management Service services: [ALERTPUBLISHER, EVENTSERVER, HOSTMONITOR, REPORTSMANAGER, SERVICEMONITOR] hosts: configs: host_default_proc_memswap_thresholds: warning: never critical: never host_memswap_thresholds: warning: never critical: never host_config_suppression_agent_system_user_group_validator: true cloudera_manager_options: CUSTOM_BANNER_HTML: "Cloudera Blog Deployment Example" #cloudera_manager_csds: # - https://archive.cloudera.com/p/specific_csd_location
definition.yml can be found here.
Setting up your inventory
This automation supports both dynamic and static inventories – dynamic meaning that we will provision virtual machines (in AWS) and then build a cluster on those hosts, however they are named, static meaning that we define a configuration file that has a list of pre-existing machines on which to build our cluster.
For a dynamic inventory we need to have configured the cloud credentials above and set the infra_type in either our profile file or in
extra_vars. We also need to provide an
inventory_template.ini file where the playbook can substitute any cloud-provided hostnames in. Our inventory template will look like this:
[cloudera_manager] host-1.example.com [cluster_master_nodes] host-2.example.com host_template=Master1 [cluster_worker_nodes] host-3.example.com host-4.example.com host-5.example.com [cluster_worker_nodes:vars] host_template=Workers [cluster:children] cluster_master_nodes cluster_worker_nodes [krb5_server] host-6.example.com [db_server] host-6.example.com [deployment:children] cluster cloudera_manager db_server krb5_server
In this file we have groups defined for
krb5_server, and the
db_server. The inventory links to the cluster host templates through the use of the
host_template variable that is assigned here to both the
cluster_worker_nodes and the
cluster_master_nodes. Note: Each host can only have one host template. In this file, the number of unique hosts will determine the number of hosts provisioned by the playbook. Note also that the
example.com hostnames are just placeholders and will get replaced by the provisioned instance hostnames.
If we wish to use a static inventory, we can create exactly the same file, except replacing
host-*.example.com with our provided hostnames. We may also wish to specify any ssh keys or ansible variables for the inventory here, for example:
[deployment:vars] ansible_ssh_private_key_file=~/.ssh/root_key ansible_user=root
The static inventory file can either be named
inventory_static.ini, or passed in as an argument to the playbook execution using the ‘
-i’ ansible runtime flag.
Running the playbook
Once we have the definition and the inventory set up, running the playbook is fairly straightforward. We can run the playbook in stages using some specific tags, or just run the whole thing end to end. We’ve spent time making sure that we can start and restart the playbook without needing to clean anything up in between runs.
To run the playbook use the following command:
ansible-playbook /opt/cloudera-deploy/main.yml \ -e "definition_path=definitions/mydefinition" <extra arguments>
Other options that you may wish to pass to this command:
||inventory_static.ini||Specify a static inventory to be used instead of a dynamic inventory|
||key1=value1<space>key2=value2||Specify additional variables to the runtime (e.g. admin_password)|
||<no value required>||For use when running the playbook without public/private keys, Ansible will prompt for an SSH password|
||<Comma separated list of tags>||To run the playbook in increments|
||0 through to 3||Turn on verbose logging|
As an example:
ansible-playbook /opt/cloudera-deploy/main.yml \ -e "definition_path=definitions/mydefinition" \ -i /opt/cloudera-deploy/definitions/mydefinition/inventory_static.ini \ --ask-pass
You can also set the
ANSIBLE_LOG_PATH environment variable to ensure that logs are saved to disk and not lost when you close the terminal.
The playbook will handle the installation of the supporting infrastructure, Cloudera Manager and the CDP Private Cloud Base cluster and a KeyTrustee cluster (if required by your submitted configuration). Cluster deployments are normally constrained by network bandwidth for parcel distribution and on the speed of your hardware, but it’s realistic to deploy a small to medium sized cluster in less than two hours.
In this blog we walked through the mechanics of how to automate the deployment of CDP Private Cloud Base onto physical or virtual machines, including in the public cloud. With a simple definition, split into three configuration files for ease of use, we’ve been able to control all aspects of the cluster deployment, including integration with the enterprise infrastructure.
Automation at this scale greatly enhances the CDP Private Cloud Base time to value. Through the use of automation we can rapidly deploy multiple clusters with much greater consistency and much more quickly. If needed, environments can be rebuilt for specific purposes, or templated for even more rapid deployment. And through having more repeatable deployments administrators and developers can spend more time focusing on onboarding tenants and developing new pipelines and insights than on deploying clusters.