Securing an Apache Hadoop Cluster Through a Gateway

(Added 6/4/2013) Please note the instructions below are deprecated. Please refer to the CDH4 Security Guide for up-to-date procedures.

A few weeks ago we ran an Apache Hadoop hackathon. ApacheCon participants were invited to use our 10-node Hadoop cluster to explore Hadoop and play with some datasets that we had loaded on in advance. One challenge we had to face was, how do we do this in a secure way? Hadoop does not offer much in the way of security. Hadoop provides a rudimentary file permission system on its distributed filesystem, HDFS, but does not verify the appropriateness of the username you are using. (Whatever username you use to start your local Hadoop client process is used as your HDFS username; this account does not necessarily need to exist on the machines which host the HDFS NameNode or DataNodes.)

Even more problematically, anyone who can connect to the JobTracker can submit arbitrary code to run with the authority of the account used to start the Hadoop TaskTrackers on each node.

While there is not a perfect solution to multitenancy in a Hadoop environment, by using a proxying gateway, you can at least control which users have access to your cluster. The rest of this post describes how to set up such a gateway configuration.

The basic idea behind a proxying gateway is that direct connections to the JobTracker, NameNode, and slave machines (hosting the DataNodes and TaskTrackers) are disallowed except from within your internal network. These machines can all have private IP addresses. We configured our Hadoop machines to use addresses in the 10.x.x.x space. A separate machine,
called the gateway, has a public IP address. Users can connect to the gateway machine through ssh and set up a SOCKS tunnel which “passes through” connections to the internal nodes. The primary benefit of this system is that only users who have linux accounts on the gateway machine can access the Hadoop cluster.

Client Configuration

For a user to connect to our Hadoop cluster, they’d have to modify their hadoop-site.xml file to include our master node’s DNS address in the fs.default.name and mapred.job.tracker fields. We gave them a prototype hadoop-site.xml file to download that looked like the following:

<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <!-- put your hadoophack username in here twice as comma,separated,values --> <property> <name>hadoop.job.ugi</name> <value>YOUR_USER_NAME,YOUR_USER_NAME</value> </property> <!-- If you changed your tunnel port, change it here. If you don't know what this means, then leave it alone. --> <property> <name>hadoophack.tunnel.port</name> <value>2600</value> </property> <!-- change these only if you know what you're doing --> <property> <name>mapred.reduce.tasks</name> <value>8</value> </property> <property> <name>mapred.child.java.opts</name> <value>-Xmx512m</value> </property> <!-- Don't change anything below here --> <property> <name>mapred.submit.replication</name> <value>5</value> </property> <property> <name>dfs.block.size</name> <value>134217728</value> </property> <property> <name>fs.default.name</name> <value>hdfs://server1.cloudera.com:9000/</value> </property> <property> <name>mapred.job.tracker</name> <value>server1.cloudera.com:9001</value> </property> <property> <name>mapred.system.dir</name> <value>/hadoop/mapred/system</value> <final>true</final> </property> <property> <name>hadoop.socks.server</name> <value>localhost:${hadoophack.tunnel.port}</value> </property> <property> <name>hadoop.rpc.socket.factory.class.default</name> <value>org.apache.hadoop.net.SocksSocketFactory</value> </property> </configuration> 

 

The username that is applied to files you create and use in HDFS is taken, by default, from whatever username you use to start your Hadoop jobs. Since a user might not have the same user account on our cluster as on their home machine, we wanted them to override this behavior. The hadoop.job.ugi parameter contains a comma-delimited list containing first the username, and then whatever UNIX-style groups should be associated with the user. We asked our users to change this to the username we assigned to their account. Since you must be in at least one group (which can have the same name as your username), your username needs to be entered in this field twice.

The fs.default.name and mapred.job.tracker fields were filled in with our head node (server1)’s DNS name. Since Hadoop MapReduce determines the mapred.system.dir property at job creation time, regardless of values set at the JobTracker or TaskTrackers, we set that for them as well as the default block size, and the default number of reduce tasks.

Finally, this file also instructs Hadoop to make all connections through a SOCKS proxy (by way of the hadoop.socks.server and hadoop.rpc.socket.factory.class.default properties). This means that given the presence of a tunnel to our gateway node, Hadoop’s connections should be made through this tunnel.

Server Configuration

Since properties are initially set at the client when a Hadoop job is started, and then inherited throughout the rest of the system, a configuration change was important inside the cluster as well. The hadoop-site.xml file that we propagated to the master node as well as all of the slaves contained the setting:

<property> <name>hadoop.rpc.socket.factory.class.default</name> <value>org.apache.hadoop.net.StandardSocketFactory</value> <final>true</final> </property> 

 

The nodes inside the cluster should connect directly to one another — these should not route connections through the SOCKS proxy. So we set the socket factory to the standard one, and made sure that the parameter was marked as final so that jobs couldn’t override it.

One “gotcha” of Hadoop is that the HDFS instance has a canonical name associated with it, based on the DNS name of the machine — not its IP address. If you provide an IP address for the fs.default.name, it will reverse-DNS this back to a DNS name, then subsequent connections will perform a forward-DNS lookup on the canonical DNS name (and will not forward the DNS lookups through the SOCKS proxy). For users outside our firewall, this means trouble. So even though server1.cloudera.com has a private IP address, we created a public DNS “A” record for it on our public DNS server — pointing to the private IP address. The upshot of this is that if you’ve got a connection to the gateway, you’ll be able to transparantly use the master server’s JobTracker and NameNode instances as though they were on your network. If not, you’re out of luck. Much more secure!

We also found that if a user mistakenly tried to run bin/start-dfs.sh (or start-all, etc) on their own machine, they would start a DataNode which would connect to our NameNode. This caused a lot of seemingly-inexplicable HDFS access and stability issues. The solution? Control which machines can register as DataNodes. We added a hosts file to HDFS which provides a whitelist of machines which can act as DataNodes. In hadoop-site.xml, we added the setting:

<property> <name>dfs.hosts</name> <value>/usr/local/hadoop/conf/hosts</value> </property> 

 

The dfs.hosts setting indicates a filename which contained the list of DNS names associated with our desired DataNodes, one name per line. Restart HDFS with that, and the errant user’s DataNode goes away, and the issues were resolved.

Client Usage Instructions

We then created user accounts on the gateway machine for all our hackathon participants. For them to use the Hadoop cluster, they had to open an SSH session to the gateway machine, opening a dynamic port-forwarding tunnel. This isn’t as hard as it sounds; it’s accomplished by:

you@localhost:~$ ssh -D 2600 clouderaUsername@gateway.cloudera.com 

 

The -D 2600 instructs SSH to open a SOCKS proxy on local port 2600. The SOCKS-based SocketFactory in Hadoop will then create connections forwarded over this SOCKS proxy. After this connection is established, you can minimize the ssh session and forget about it. Then just run Hadoop jobs in another terminal the normal way:

you@localhost:~$ $HADOOP_HOME/bin/hadoop fs -ls / you@localhost:~$ $HADOOP_HOME/bin/hadoop jar myJarFile.jar myMainClass etc... 

 

FoxyProxy

A disadvantage of putting your nodes behind a firewall is that users can no longer access the web-based status monitoring tools that come with Hadoop. The JobTracker status page on port 50030 and NameNode status page on port 50070 are inaccessible. Configuring a web browser to route all traffic through the SOCKS proxy wouldn’t be a good idea, because then they’d be unable to access any other web sites they might need. By using FoxyProxy, a set of regular expression-based rules can be set up which determine what URLs get forwarded through a proxy, and what URLs are accessed without one.

Users were instructed to download FoxyProxy, a free extension for FireFox. They were told to add rules forwarding http://server*.cloudera.com:*/* and http://10.1.130.*:*/* through a SOCKS proxy at localhost:2600. Now they were able to see the status pages, browse the DFS through their web browser, and access the error logs associated with the Hadoop daemons in case something went wrong with their jobs. This is used in conjunction with the SSH-based tunnel which they opened earlier.

Advanced Server Configuration with IPTables

If we just wanted to leave our servers with private IPs only, we could have left it at that. But we wanted these servers to also be able to connect out to the rest of the Internet. Doing this requires giving the machines public IP addresses again (or configuring them behind a NAT firewall). How to ensure that we could open outbound connections, but drop all inbound connections? We employed some simple IPTables rules to drop all incoming connections on the ethernet device with the public IP address. The output of /sbin/iptables -L on any of the private-network machines is as follows:

Chain INPUT (policy DROP) target prot opt source destination ACCEPT all -- static-internal.reverse.cloudera.com/28 anywhere DROP all -- anywhere anywhere ACCEPT all -- anywhere anywhere ACCEPT all -- anywhere anywhere Chain FORWARD (policy ACCEPT) target prot opt source destination Chain OUTPUT (policy ACCEPT) target prot opt source destination ACCEPT all -- anywhere anywhere ACCEPT all -- anywhere anywhere 

 

This tells the machines to accept any inbound connections in the /28 where we host our machines, but drop all other inbound new connections. Outbound connections to any destination are allowed.

Future Steps…

We started doing this process a little later than we’d have liked, and learned along the way what did and did not work so well. Configuring FoxyProxy turned out to be a cumbersome exercise for users. (There are a lot of button clicks and windows involved in setting up those relatively straightforward rules.) We’ve also found out that FoxyProxy can export its rules in a “PAC” file, which other users can then import. Newer users of our cluster just download and install this file in their FoxyProxy config. Ultimately, what we’d like best is to install an HTTP Forwarding Proxy on the gateway node, which would allow our users to see the server status and logs without having to install software and configure a dynamic proxy. Currently this is still on the to-do list though.

Hadoop is pretty straightforward to use, but securing it can be tricky. There were a bunch of steps we had to follow above, but we achieved the end goal of imposing few demands on our legitimate users. Overall, this provided us with much better assurances about our cluster. The two major areas where this does not provide security are that users can impersonate one another with respect to the distributed filesystem, and they also all run processes in the same user’s process space — so they could theoretically interfere with other Hadoop user processes, or the Hadoop daemons themselves. But clamping down on who has access to the Hadoop cluster in the first place goes a long way toward keeping a system healthy.

Stay tuned soon, for when we announce the contest winners!

Filed under:

12 Responses
  • Steven Wong / July 14, 2009 / 6:15 PM

    Question: “an HTTP Forwarding Proxy on the gateway node” — are you referring to a reverse proxy?

  • Aaron / August 04, 2009 / 10:31 AM

    yes.

  • Andy / September 10, 2009 / 5:42 PM

    Great post! Very helpful and much appreciated.

    You mention you use hadoop.job.ugi to set the username/group for submitted jobs. Should that configuration also work for basic hdfs operations as well ( e.g., -copyFromLocal )? My quick attempt ( with hadoop 0.20 from a windows vista client ) did not seem to honor hadoop.job.ugi for ‘hadoop dfs’ commands.

  • Andy / September 10, 2009 / 6:01 PM

    Please ignore my previous comment. Wrong configuration parameter was used ( dfs.web.ugi ).

  • Hawksury / August 25, 2010 / 10:37 AM

    Great post mate, I have few questions would much appreciate if you could help/answer.

    The data I want to process is on a local machine (3TB in size) i want to use EC2 machines to process this data by submitting Hive Queries from a local machine that is outside of Amazon EC2.

    Could you please help or point me to the correct post where the process is explained.

    Many Thanks,
    Hawksury

  • Aaron / August 25, 2010 / 4:47 PM

    Hawksury,

    After you’ve got your cluster configured and have your local system configured to talk to it, you can copy your data up to HDFS, and then operate on it with Hive as usual.

    You may want to watch our training video on how to use Hive: http://www.cloudera.com/videos/introduction_to_hive

    - Aaron

  • Jeff / June 16, 2011 / 11:07 PM

    I think you made a mistake in the post.

    property hadoop.rpc.socket.factory.class.default should been set to org.apache.hadoop.net.SocksSocketFactory rather than org.apache.hadoop.net.StandardSocketFactory

  • vbarat / July 17, 2012 / 4:23 AM

    Great post! That’s definitively what we were looking for.

  • vbarat / July 17, 2012 / 4:36 AM

    Actually, by changing hadoop.job.ugi, you can acquire the right of any user you want.

    So event if the solution is good to protect your cluster from unwanted users, you don’t protect your granted users from each other (one of them can very easily steal the data of any other user).

    Any solution for that?

  • Todd Lipcon / July 23, 2012 / 5:29 PM

    Hi V Barat,

    Since CDH3, you can configure true security on CDH. The “hadoop.job.ugi” parameter no longer exists. If a cluster has been secured, it uses Kerberos to provide strong authentication, and then POSIX permissions for authorization.

    Please see the CDH3 or CDH4 Security Guide documentation for details.

    -Todd

  • Jeff / April 25, 2013 / 1:39 AM

    If I use kerberos authentication, How can I use socks proxy to connect to hadoop ? Is there any tutorial for that ? Thanks

Leave a comment


one × 9 =