Apache Hadoop is equipped with a robust and scalable security infrastructure. It is being used at some of the biggest cluster installations in the world, where hundreds of terabytes of sensitive and critical data are processed every day.
Owen O’Malley provided a nice overview of Apache Hadoop security in his blog Motivations for Apache Hadoop Security. Devaraj Das also covered some of the core pieces of Apache Hadoop’s security architecture in his blog The Role of Delegation Tokens in Apache Hadoop Security.
The intent of this blog is to cover some of the features of the Apache Hadoop security infrastructure that will help cluster administrators fine-tune the security settings of their clusters.
Quality of Protection
Security infrastructure for Hadoop RPC uses Java SASL APIs. Quality of Protection (QOP) settings can be used to enable encryption for Hadoop RPC protocols.
Java SASL provides following QOP settings:
- “auth” – This is the default setting and stands for authentication only. This implies that the client and server mutually authenticate during connection setup.
- “auth-int” – This stands for authentication and integrity. This setting guarantees integrity of data exchanged between client and server as well as authentication.
- “auth-conf” – This stands for authentication, integrity and confidentiality. This setting guarantees that data exchanged between client and server is encrypted and is not readable by a “man in the middle”.
Hadoop lets cluster administrators control the quality of protection via the configuration parameter “hadoop.rpc.protection” in core-site.xml. It is an optional parameter and if not present the default QOP setting of “auth” is used, which implies “authentication only”. The valid values for this parameter are:
- “authentication” : Corresponds to “auth”
- “integrity” : Corresponds to “auth-int”
- “privacy” : Corresponds to “auth-conf”
The default setting is kept as authentication only because integrity checks and encryption have a cost in terms of performance.
Hostname in the Principals
The Apache Hadoop daemon processes (Datanode, Namenode, Tasktracker, Jobtracker) in a secure Hadoop installation, each have a Kerberos principal. For example a datanode principal could look like, datanode/datanode-hostname@realm. It is a common practice to use a hostname in the middle because it gives uniqueness to the principal names for each datanode or tasktracker. There are two main reasons why it is important to use unique principal names.
- If kerberos credentials (keytab) for one datanode are compromised, it won’t lead to all datanodes getting compromised.
- If multiple datanodes with same principal are simultaneously connecting to the namenode, and if the Kerberos authenticator being sent happens to have same timestamp, then the authentication is rejected as a replay request.
However, hostname in the principal means that the datanode principal must be separately configured for each datanode in the cluster, which could mean several hundred machines. Hadoop provides a cool feature to simplify the configuration. In hdfs-site.xml (or mapred-site.xml for task trackers), the principals can also be specified using the _HOST string for the hostname in the middle. The principal in the datanode example mentioned above can also be specified as datanode/_HOST@realm in the configuration file. Please note that the actual principal is still datanode/datanode-hostname@realm, and _HOST is just a placeholder for datanode-hostname. Hadoop interprets and replaces _HOST appropriately wherever needed. Thus, each datanode has the same value for dfs.datanode.kerberos.principal in the configuration even though the principals are different.
Kerberos Principals and UNIX User Names
Hadoop uses group memberships of users at various places, such as to determine group ownership for files or for access control. A user is mapped to the groups it belongs to using an implementation of the GroupMappingServiceProvider interface. The implementation is pluggable and can be configured in core-site.xml.
Hadoop by default uses ShellBasedUnixGroupsMapping, which is an implementation of GroupMappingServiceProvider. It fetches the group membership for a user name by executing a UNIX shell command.
In secure clusters, since the user names are actually kerberos principals, ShellBasedUnixGroupsMapping will work only if the kerberos principlals map to valid UNIX user names.
Hadoop provides a feature that lets administrators specify mapping rules to map a kerberos principal to a local UNIX user name.
The rules are specified in core-site.xml with configuration key “hadoop.security.auth_to_local”. For example:
hadoop.security.auth_to_local
RULE:[1:$1@$0](.*@YOUR.REALM)s/@.*//
RULE:[2:$1@$0](hdfs@.*YOUR.REALM)s/.*/hdfs/
DEFAULT
The rest of this section explains how these rules are interpreted and specified.
The default rule is simply “DEFAULT”, which takes all principals in your default domain to their first component. For example, “username@APACHE.ORG” and “username/admin@APACHE.ORG” to “username”, if your default domain is APACHE.ORG.
The translations rules have 3 sections: base, filter, and substitution.
The base is the number of components in the principal name excluding the realm and the pattern for building the name from the sections of the principal name. The base uses $0 to mean the realm, $1 to mean the first component and $2 to mean the second component.
For example:
[1:$1@$0] translates “username@APACHE.ORG” to “username@APACHE.ORG”
[2:$1] translates “username/admin@APACHE.ORG” to “username”
[2:$1%$2] translates “username/admin@APACHE.ORG” to “username%admin”
The filter is a regex in parentheses that must be the generated string for the rule to apply.
For example:
“(.*%admin)” will take any string that ends in “%admin”
“(.*@SOME.DOMAIN)” will take any string that ends in “@SOME.DOMAIN”
Finally, the substitution is a sed rule to translate a regex into a fixed string.
For example:
“s/@ACME.COM//” removes the first instance of “@SOME.DOMAIN”.
“s/@[A-Z]*.COM//” removes the first instance of “@” followed by a name followed by “.COM”.
“s/X/Y/g” replaces all of the “X” in the name with “Y”
So, if your default realm was APACHE.ORG, but you also wanted to take all principals from SOME.DOMAIN that had a single component “joe@SOME.DOMAIN”, you would use:
RULE:[1:$1@$0](.@SOME.DOMAIN)s/@.//
DEFAULT
To translate the names with a second component, you would make the rules:
RULE:[1:$1@$0](.@SOME.DOMAIN)s/@.//
RULE:[2:$1@$0](.@SOME.DOMAIN)s/@.//
DEFAULT
If you want to treat all principals from APACHE.ORG with /admin as “admin”, your rules would look like:
RULE[2:$1%$2@$0](.%admin@APACHE.ORG)s/./admin/
DEFAULT
Thanks!
Apache Hadoop security was a collaborative effort of a team of engineers. I credit the content of this article to their outstanding work and extend my special thanks to Owen for the detailed explanation of auth_to_local rules and to Devaraj for his valuable suggestions.
— Jitendra Pandey