One of the more confusing topics in Hadoop is how authorization and authentication work in the system. The first and most important thing to recognize is the subtle, yet extremely important, differentiation between authorization and authentication, so let’s define these terms first:
Authentication is the process of determining whether someone is who they claim to be.
Authorization is the function of specifying access rights to resources.
In simpler terms, authentication is a way of proving who I am, and authorization is a way of determining what I can do.
If Hadoop is configured with all of its defaults, Hadoop doesn’t do any authentication of users. This is an important realization to make, because it can have serious implications in a corporate data center. Let’s look at an example of this.
Let’s say Joe User has access to a Hadoop cluster. The cluster does not have any Hadoop security features enabled, which means that there are no attempts made to verify the identities of users who interact with the cluster. The cluster’s superuser is hdfs, and Joe doesn’t have the password for the hdfs user on any of the cluster servers. However, Joe happens to have a client machine which has a set of configurations that will allow Joe to access the Hadoop cluster, and Joe is very disgruntled. He runs these commands:
sudo useradd hdfs sudo -u hdfs hadoop fs -rmr /
The cluster goes off and does some work, and comes back and says “Ok, hdfs, I deleted everything!”.
So what happened here? Well, in an insecure cluster, the NameNode and the JobTracker don’t require any authentication. If you make a request, and say you’re hdfs or mapred, the NN/JT will both say “ok, I believe that,” and allow you to do whatever the hdfs or mapred users have the ability to do.
Hadoop has the ability to require authentication, in the form of Kerberos principals. Kerberos is an authentication protocol which uses “tickets” to allow nodes to identify themselves. If you need a more in depth introduction to Kerberos, I strongly recommend checking out the Wikipedia page.
Hadoop can use the Kerberos protocol to ensure that when someone makes a request, they really are who they say they are. This mechanism is used throughout the cluster. In a secure Hadoop configuration, all of the Hadoop daemons use Kerberos to perform mutual authentication, which means that when two daemons talk to each other, they each make sure that the other daemon is who it says it is. Additionally, this allows the NameNode and JobTracker to ensure that any HDFS or MR requests are being executed with the appropriate authorization level.
Authorization is a much different beast than authentication. Authorization tells us what any given user can or cannot do within a Hadoop cluster, after the user has been successfully authenticated. In HDFS this is primarily governed by file permissions.
HDFS file permissions are very similar to BSD file permissions. If you’ve ever run `ls -l` in a directory, you’ve probably seen a record like this:
drwxr-xr-x 2 natty hadoop 4096 2012-03-01 11:18 foo -rw-r--r-- 1 natty hadoop 87 2012-02-13 12:48 bar
On the far left, there is a string of letters. The first letter determines whether a file is a directory or not, and then there are three sets of three letters each. Those sets denote owner, group, and other user permissions, and the “rwx” are read, write, and execute permissions, respectively. The “natty hadoop” portion says that the files are owned by natty, and belong to the group hadoop. As an aside, a stated intention is for HDFS semantics to be “Unix-like when possible.” The result is that certain HDFS operations follow BSD semantics, and others are closer to Unix semantics.
The real question here is: what is a user or group in Hadoop? The answer is: they’re strings of characters. Nothing more. Hadoop will very happily let you run a command like
hadoop fs -chown fake_user:fake_group /test-dir
The downside to doing this is that if that user and group really don’t exist, no one will be able to access that file except the superusers, which, by default, includes hdfs, mapred, and other members of the hadoop supergroup.
In the context of MapReduce, the users and groups are used to determine who is allowed to submit or modify jobs. In MapReduce, jobs are submitted via queues controlled by the scheduler. Administrators can define who is allowed to submit jobs to particular queues via MapReduce ACLs. These ACLs can also be defined on a job-by-job basis. Similar to the HDFS permissions, if the specified users or groups don’t exist, the queues will be unusable, except by superusers, who are always authorized to submit or modify jobs.
The next question to ask is: how do the NameNode and JobTracker figure out which groups a user belongs to?
When a user runs a hadoop command, the NameNode or JobTracker gets some information about the user running that command. Most importantly, it knows the username of the user. The daemons then use that username to determine what groups the user belongs to. This is done through the use of a pluggable interface, which has the ability to take a username and map it to a set of groups that the user belongs to. In a default installation, the user-group mapping implementation forks off a subprocess that runs `id -Gn [username]`. That provides a list of groups like this:
natty@vorpal:~/cloudera $ id -Gn natty natty adm lpadmin netdev admin sambashare hadoop hdfs mapred
The Hadoop daemons then use this list of groups, along with the username to determine if the user has appropriate permissions to access the file being requested. There are also other implementations that come packaged with Hadoop, including one that allows the system to be configured to get user-group mappings from an LDAP or Active Directory systems. This is useful if the groups necessary for setting up permissions are resident in an LDAP system, but not in Unix on the cluster hosts.
Something to be aware of is that the set of groups that the NameNode and JobTracker are aware of may be different than the set of groups that a user belongs to on a client machine. All authorization is done at the NameNode/JobTracker level, so the users and groups on the DataNodes and TaskTrackers don’t affect authorization, although they may be necessary if Kerberos authentication is enabled. Additionally, it is very important that the NameNode and the JobTracker both be aware of the same groups for any given user, or there may be undefined results when executing jobs. If there’s ever any doubt of what groups a user belongs to, `hadoop dfsgroups` and `hadoop mrgroups` may be used to find out what groups that a user belongs to, according to the NameNode and JobTracker, respectively.
Putting it all together
A proper, safe security protocol for Hadoop may require a combination of authorization and authentication. Admins should look at their security requirements and determine which solutions are right for them, and how much risk they can take on regarding their handling of data. Additionally, if you are going to enable Hadoop’s Kerberos features, I strongly recommend looking into Cloudera Manager, which helps make the Kerberos configuration and setup significantly easier than doing it all by hand.