Learn how to use Cloudera Director, Microsoft Active Directory (AD DS, AD CS, AD DNS), SAMBA, and SSSD to deploy a secure EDH cluster for workloads in the public cloud.
Authenticating users in Apache Hadoop is the first line of security we recommend. Like most, if not all RDBMS, a user is provided with a username and a password to validate their identity. This is a requirement to access any data managed by those systems. The goal is the same in Apache Hadoop. Since the Hadoop stack does not have an authentication component, Kerberos Key Distribution Center is used as the mechanism to identify users.
There are two implementations of a Kerberos KDC that are supported on a CDH cluster: A MIT KDC installation, and/or integration with Microsoft Active Directory (AD) built-in Kerberos KDC. Generally, the latter is recommended to our enterprise customers and the blog will focus on a direct integration of CDH and the Active Directory KDC. This integration is favored because of other tools that will be used to communicate with Active Directory.
Active Directory is mainly known for its Domain Service (AD DS) service as an Identity Management service which authenticates users and groups. However, there are other powerful services within AD like AD CS, and AD DNS.
On May 6, 2016, my colleague, Ben Spivey wrote a blog on securing a cluster on Amazon AWS. He covered a great deal on the AD DS and AD CS services. For more details, Ben’s blog is a good place to start. This blog will spend more time on AD DNS service.
Active Directory Domain Name System
Deploying a CDH cluster requires both forward and reverse name resolution for internal IP addresses. When deploying a cluster on-premises, this is usually done by your system administrator. When you deploy a cluster on Amazon AWS, this is automatically configured when you launch an EC2 instance.
A forward DNS lookup is resolving a Fully Qualified Domain Name (FQDN) to an IP address, and a reverse DNS lookup is doing the opposite, resolving an IP address to a FQDN. Currently, Microsoft Azure does not provide reverse DNS lookup for internal private IP addresses. This will be covered later.
There are many options for DNS when deploying on Azure. You can install the supported BIND package for your Linux OS, an existing Active Directory Domain Name System, etc. This blog will cover the AD DNS in more details.
If not already configured, ensure your AD administrator has properly configured a reverse DNS zone in the DNS Manager as seen below.
The important section in the figure above, is the red box in the “Reverse Lookup Zones”. This illustrates the zone configured to host all the DNS objects for a particular subnet.
This is a view of the “Forward Lookup Zones” for the CLOUDERA.MORANTUS.COM domain.
Also a view of my OU tree showing zero entries
Azure Virtual Machine
I provisioned a VM in Azure with all the default DNS settings, and we will join it to our AD DS and DNS services.
As you can see, the hostname -f command displays a very long FQDN for my VM and hostname -i gives us the IP address associated with the VM. Next, I did a forward DNS lookup using the host FQDN command, which resolved to the IP address. Then, I did a reverse DNS lookup using host IP–address as shown in the red box above, it did not locate a reverse entry for that IP address. A reverse lookup is a requirement for a CDH deployment. We’ll revisit this later.
In order to configure our RHEL 6.7 VM to communicate with Active Directory, we need to configure a tool called samba. Samba is a Linux based utility that enables the integration of Linux systems with AD.
Join the VM to AD with Samba
- Ensure the DNS servers property for your Virtual Network in the Azure portal is pointed to your AD server.
- Install packages needed to integrate with AD
sudo yum install -y samba-common krb5-workstation openldap-clients
- Configure the VM to point to the AD DNS server
The nameserver is the IP address for the AD server. This can also be accomplished by running “service network restart” on the VM
- Configure samba to join the AD domain and verify the entry in AD. This must be executed as a privileged user. In this case “jmorantus” is an admin account in Active Directory.
As you can above, we succeeded joining our VM to the AD domain and an AD object was created in the OU servers.
- Configure Kerberos krb5.conf file to generate keytab file to update DNS in AD
- Update/Create Forward and Reverse DNS entries
View of Forward DNS entry added to AD DNS service
View of reverse DNS entry added to AD DNS service.
Note: it’s worth mentioning that Active Directory will age DNS entries that it considers “inactive”. An additional process should be implemented to keep these entries “alive” in AD.
The System Security Service Daemon is used to cache users and groups information locally to a Linux system. This integration is also necessary to configure authorization with Apache Sentry for data access.
Now that SSSD is fully configured, we’ll verify we can read user information from AD.
Here you can see with SSSD stopped, the VM does not know of user “scm-cloudera”. With SSSD running, the user information was pulled from AD. If you are looking for a commercial option, Cloudera also recommends Centrify.
You should now be able to configure a VM on Azure, join an AD domain, and create DNS entries in AD DNS server. These steps will work for any other cloud provider and on-premise deployments. In Part 2 of this series, we’ll cover creating a Kerberized cluster with Cloudera Director on Azure.
James Morantus is a Senior Solution Consultant at Cloudera