Post written by Cloudera Software Engineer Aaron T. Myers.
Apache Hadoop has had methods of doing user authorization for some time. The Hadoop Distributed File System (HDFS) has a permissions model similar to Unix to control file and directory access, and MapReduce has access control lists (ACLs) per job queue to control which users may submit jobs. These authorization schemes allow Hadoop users and administrators to specify exactly who may access Hadoop’s resources. However, until recently, these mechanisms relied on a fundamentally insecure method of identifying the user who is interacting with Hadoop. That is, Hadoop had no way of performing reliable authentication. This limitation meant that any authorization system built on top of Hadoop, while helpful to prevent accidental unwanted access, could do nothing to prevent malicious users from accessing other users’ data.
Prior to the availability of Hadoop’s security features, the only way an organization could meet the requirement for data access protection was to run multiple distinct Hadoop clusters, and to segregate the groups who have network access to these clusters. This has obvious cost effectiveness implications, but, more importantly, limits the flexibility an organization has with respect to data storage options. One of the inherent powers of Hadoop is the ability to store and correlate all of an organization’s data. This is impossible if one must a priori relegate data to multiple distinct clusters based on security requirements. Furthermore, because of some organizations’ internal security policies, certain types of data could not be stored in Hadoop at all.
While this was acceptable for many of the first organizations to leverage Hadoop, the increase in Hadoop’s popularity and penetration into traditional enterprises necessitated the addition of better authentication mechanisms.
Among many of the new features introduced as part of CDH3 Beta 3, Hadoop now has the ability to provide strong authentication guarantees. The core Hadoop security work was done almost completely by Yahoo! and subsequently contributed to Apache Hadoop. Rather than create an ad hoc Hadoop-specific authentication scheme, Hadoop’s authentication system leverages Kerberos. Kerberos is an industry-standard authentication system developed by MIT which has been in existence since 1989. There are multiple open source implementations of Kerberos, including one produced and maintained by MIT itself. Kerberos is also the authentication system underpinning many proprietary identity management systems commonly found in enterprise environments, including Microsoft’s Active Directory. Hadoop’s support of Kerberos enables organizations to seamlessly integrate the new authentication features of Hadoop with their existing authentication and single sign-on systems.
All of the components of CDH3 Beta 3 now have support for interacting with secure Hadoop clusters, and many have incorporated additional security features which were previously impossible or impractical to implement with the security limitations inherent in Hadoop itself. Because of the complexity of integrating with multiple third-party authentication systems, configuring Hadoop and its associated components to use these systems is non-trivial.
Cloudera is pleased to announce the general availability of Cloudera’s “CDH3 Security Guide”. In this comprehensive guide, you’ll find instructions for enabling the security features of Hadoop itself, as well as for configuring all of the other components of CDH to be able to interact with a Hadoop cluster with security enabled. You’ll also find a troubleshooting guide for debugging common errors encountered when configuring a secure Hadoop environment, as well as details for configuring Hadoop’s authentication mechanism to use Active Directory. Please email firstname.lastname@example.org if you have any questions or encounter any issues.