Cloudera provides a pathway for sharing metadata from an Altus Director managed cluster with Cloudera Altus Data Engineering or Altus Data Warehouse clusters. This blog post outlines how to use Altus Director to set up the required infrastructure as well as configuring the CDH components to enable this functionality.
SDX for Cloudera Altus persists both Apache Hive metadata and Apache Sentry data access policies independently from clusters in SDX namespaces. In this way, SDX for Cloudera Altus provides the missing link to share metadata and security policies between workloads managed by Altus Director with Altus services managed workloads. Separating metadata from the compute resources enables transient workloads and users never have to worry about losing data context, even when a cluster is terminated. Whenever new clusters are created and attached to the same SDX namespace, the existing table metadata and access policies apply to the new cluster right away.
The following example demonstrates how to share metadata between a cluster created by Altus Director with an Altus Data Warehouse cluster.
Before getting started:
- Ensure that an Altus Director instance is running.
- Work with an Altus admin to create a user and allow that user to create and manage the following resources
- SDX namespaces
- Altus environments
- Data Warehouse clusters
Altus Director: Creating a Cluster with an External Database
SDX configured namespaces, helpful for migrating from Altus Director to other Altus services, require Apache Hive and Apache Sentry to be set up to use an external database server. Altus Director provides the ability to create a CDH deployment and attach an external database to the cluster. This database can be either MySQL or PostgreSQL. Altus Director also has the ability to create an RDS database in AWS on behalf of the user to be utilized as an external database. An external database in Altus Director operates the same as an Altus SDX namespace.
The Altus Director team provides sample configurations to quickly get started. Here is a sample SDX configuration. This configuration includes Hive and Impala, but not Spark. If the use case in question requires other services, there are several reference configurations in this same git repository.
In order to take advantage of RDS, users need to modify the above configuration to create an RDS server instead of using an existing database. This configuration provides an example of specifying RDS database servers.
The following example is a snippet which should be included in the Altus Director configuration that shows the steps to combine Sentry with an RDS database:
# define a database name for reuse later rds { name: "jheyming-mysql1" } databaseServers { # the name of the RDS database that will be created in AWS by Director jheyming-mysql1 { type: mysql user: root password: <redacted> instanceClass: db.m3.medium dbSubnetGroupName: my-subnet-group vpcSecurityGroupIds: "sg-12345678" allocatedStorage: 10 engineVersion: 5.5.53 tags { owner: ${?USER} } } } # cluster configuration cluster { databaseTemplates: { HIVE { name: hivetemplate databaseServerName: ${rds.name} databaseNamePrefix: hive usernamePrefix: hiveu } # ... repeat for HUE, OOZIE, SENTRY # Sentry admin groups # important for later when trying to administer databases in the Hue query editor configs { SENTRY { sentry_service_admin_group: "hive,impala,hue,solr,svc_admin" sentry_service_allow_connect: "hive,impala,hue,hdfs,solr,svc_admin" } } } # cloudera manager can use the database too cloudera-manager { databaseTemplates { CLOUDERA_MANAGER { name: cmtemplate databaseServerName: ${rds.name} databaseNamePrefix: cm usernamePrefix: cmu } # repeat for ACTIVITYMONITOR, REPORTSMANAGER, NAVIGATOR, NAVIGATORMETASERVER… }
Once the configuration is ready, use the Altus Director bootstrap-remote CLI to execute it:
$ cloudera-director bootstrap-remote \ /Users/jheyming/sdx-with-rds.conf \ --lp.remote.hostAndPort=localhost:7189 \ --lp.remote.username=admin \ --lp.remote.password=admin
The user who runs this command must have the ability to create EC2 instances as well as create RDS databases in AWS. This user must also have access to the VPC subnets and security groups in this configuration. Learn more here.
The progress of the bootstrap command can be viewed in the Altus Director UI:
Cloudera Manager: Working with Hue
Now that Cloudera Manager is up and running Sentry permissions must be configured via Impala SQL using Hue. To do that, add a user as a Sentry admin and create a parallel Hue user (in this case I’ll create a user jheyming which we’ll use as the administrator). Navigate to Cloudera Manager via Altus Director:
Log into the Cloudera Manager server and navigate to the Hue service. There, find out the IP address for the Hue server.
Making note of the IP address, go to that address at port 8888. Cloudera Manager provides quick links to navigate to the host. Before logging into Hue, create an admin user in order to create some tables.
One way to set up this user for Hue is to log into each host in Cloudera Manager and run the useradd command. Hue with Sentry can authorize these users to have access to the Hue interface. This user also needs to be a Sentry admin. Look back to the Director configuration; there were Sentry admin groups defined there. In that configuration, there was an admin group defined for Impala. Use this same group and assign this user to the group.
To create a user on each host quickly, the following command can be used to batch SSH log into each host in the Cloudera Manager. Use the IP addresses of each host in Cloudera Manager and then use SSH to create a user. Here is an example of doing this in bulk:
for host in 10.38.0.250 10.38.1.128 10.38.3.107 10.38.3.141 10.38.4.231 10.38.5.181 10.38.7.181 10.38.7.206 do; ssh -t -o UserKnownHostsFile=/dev/null \ -o StrictHostKeyChecking=no \ -i /Users/jheyming/.ssh/my.pem \ centos@$host \ "sudo useradd -g 479 jheyming --password='encrypted'"; done; # @see https://serverfault.com/questions/367559/how-to-add-a-user-without-knowing-the-encrypted-form-of-the-password # 479 was found to be the group id for impala by inspecting /etc/group. # I chose to use the 'impala' group because this group was allocated # administrative rights for the Sentry service in our Altus Director # conf file. If you don't want to use the built-in impala group, you # can recreate your cluster with your own custom admin group.
Then on the Hue server, log in as admin user with admin as the password. Then create a user with the same name (for example, jheyming) using the User Management page.
Log in as that user and run the following queries:
create role admin; grant role admin to group `impala`; grant all on server server1 to role admin; create database jheyming_director;
Group impala was added as an admin Sentry group in the Altus Director config snippet above.
Granting the admin role to the Impala Sentry admin group allows the new user to run the create database command above.
SDX with Altus
On the Altus console, create an environment that has appropriate access to the RDS databases created for Hive and Sentry. The simplest way is to use the same region, security group, and IAM role as the cluster that was created using Altus Director. A power user can use the “Environment Wizard” to create an environment. In the absence of a user with the “PowerUser” role, an administrator can create the necessary environment.
Ensure the environment has the secure checkbox checked so that it can take advantage of the same Sentry metadata as the cluster that Altus Director created.
Next, create an SDX namespace to associate the previously create RDS database in the Altus console.
Navigate to SDX in the Altus console and create a “Configured Namespace.”
The form prompts the user for a few things:
- Two JDBC URLs: one for Hive and one for Sentry
- The credentials to access the databases
These URLs can be found in Altus Director. Navigate to the “Cluster Details” page and click “Learn More” in the Database Servers section to reveal the JDBC URLs for Hive and Sentry.
After creating the SDX namespace, make sure to note the admin group at the bottom of the namespace details page.
This group is automatically created by Altus. Any Altus user that is added to this group will have Sentry admin privileges in the SDX namespace. Altus uses this group to automatically performs the manual steps specified above that granted Sentry admin permissions to a user.
Now, we can navigate to Altus Data Warehouse and create a cluster using the Altus environment created earlier:
After creating the cluster, navigate to the Query Editor and run the following Impala SQL statements to grant admin privileges to the SDX admin group:
grant role admin to group `adminjheymingd_7d9715d8_55dc4fc7`; create database jheyming_sdx_altus;
To prove it all is working, invalidate the cache in Hue. Showing the databases should now reveal both of the new databases .
Success!!
Here we see that the database that we created using our cluster that was created by Altus Director is visible in the Altus console because we used SDX. This proves that an Altus Director cluster with external databases can now use the same databases as SDX namespaces in Altus. Users can now safely terminate and re-create transient clusters in Altus and still see the same metadata after creating additional clusters.