Avoiding Common Hadoop Administration Issues
It’s easy to get started with Hadoop administration because Linux system administration is a pretty well-known beast, and because systems administrators are used to administering all kinds of existing complex applications. However, there are many common missteps we’re seeing that make us believe there’s a need for some guidance in Hadoop administration. Most of these mistakes come from a lack of understanding about how Hadoop works. Here are just a few of the common issues we find:
Lack of configuration management
It makes sense to start with a small cluster and then to scale out over time as you find initial success and your needs grow. Without a centralized configuration management framework, you end up with a number of issues that can cascade just as your usage picks up. For example, manually ssh-ing and scp-ing files around by hand is a great way to effectively manage a small handful of machines, but as soon as your cluster gets to 5 or more nodes (let alone tens or hundreds, where Hadoop really shines), it becomes very cumbersome to manage and confusing to keep track of which files go (and have gone) where. Also, as the cluster evolves and becomes more heterogeneous, you have different versions of config files to manage, with each version changing over time. This adds a version control requirement to your configuration management. Such a solution might include a parallel shell and other established routines for starting and stopping cluster processes, copying files around the cluster, and making sure cluster configurations are kept in sync.
It hasn’t quite worked so far to normalize this within a Hadoop distribution. Even though it’s a necessary component of a successful Hadoop deployment, there are many different ways to do it. It’s important not to overlook. Cloudera can help you to determine how Hadoop fits into your existing configuration management framework, we can make recommendations on how to accommodate Hadoop, and down the road we’ll be addressing this area with better software.
Poor allocation of resources
One of the most common questions we get is, how many map slots and reduce slots should we allocate on a given machine? If the administrator puts the wrong values here, there are a number of unfortunate potential consequences. You might see excessive swapping, long-running tasks, out-of-memory errors, or task failures. The number of “slots” on a machine seems like a simple enough configuration to tweak, but the optimal value is subject to many different factors, such as CPU power, disk capacity, network speed, application design, and the other processes that are sharing a node.
For customers with CPU-intensive MapReduce application, we encourage our customers to guess high values and then tweak down as we notice how resources max out. But with an I/O intensive-application using HBase, we tell customers to start with a smaller allocation of slots and then tweak up as you can. There’s no hard and fast rule and it’s important to work with your existing jobs. Using your specific application, you should be able to iterate, tweak, compare, and repeat, asking yourself at every step of the way: does this help or hurt? With a new cluster, you can use manufactured benchmarks (TestDFSIO, teragen / sort) but there’s no point in tweaking if you have nothing to which you can compare. If you don’t have an understanding of how these factors affect each other, in the context of your use cases, you’re going to spend too much time tweaking variables at random, you’re not going to be able to compare methodically, you might not see the performance you expect, and it might sour your whole experience of Hadoop.
Lack of a dedicated network
While Hadoop doesn’t require a dedicated network to install and to run, it definitely requires one in order to run as designed and perform well. Data locality is central to the design of HDFS and MapReduce. What’s more, the shuffle-sort operation between the map and reduce phases of a job causes a good chunk of network traffic which can adversely affect and be affected by constraints on a shared network. Without a dedicated network, Hadoop has no way of determining how to allocate data blocks or where to schedule tasks to maximize network locality and ensure fault tolerance. The picture becomes more complicated as you add components such as HBase, which gets unhappy when it can’t access HDFS data due to a slow network. It’s easy to take the network architecture for granted, and equally easy to assume network architecture doesn’t matter. Rather than scatter a Hadoop cluster across a data center without thought to rack awareness or data locality considerations, it’s better to get it right the first time with our assistance.
When we discuss “a dedicated network”, we’re referring to dedicated network hardware. Your Hadoop cluster deserves a dedicated switch (or set of switches) corresponding to a dedicated rack (or set of racks). This ensures optimal performance when nodes communicate with one another, and optimal power allocation such that a power circuit failure is equivalent to a switch failure.
It’s hard for any administrator to know when to virtualize and abstract, and when to think about the bare metal. When you apply expertise from Cloudera, you can be sure you’re getting optimal understanding of both.
Lack of monitoring and metrics
Aside from the web UI, Hadoop doesn’t provide much in the way of built-in monitoring, so it’s tempting to think this isn’t that important. However, it’s critical to roll out some solution for capturing Hadoop’s metrics as well as monitoring and alerting the general OS and network health of the cluster. Tools like Ganglia and Nagios give you the ability to see the health of the entire cluster in a single dashboard, and be alerted when a health value exceeds a threshold, respectively. Without these monitoring tools in place, it becomes very difficult to determine what’s going on when something goes awry in a cluster of any considerable size. One very simple real world example from a recent consulting engagement: the Hadoop start scripts drop a pid file in a directory and report “OK” when they start. In this case, the processes encountered a simple permissions problem during startup, and exited gracefully, leaving the inexperienced administrator to wonder (some time later) when the process died and what killed it, even though it never really started. We’ve also seen occurrences where the local disk of one or a few nodes in a cluster fills up, leading to what appears as a flaky cluster, with sporadic task failures. A monitoring/alert process tells you about these problems before they became costly. And it’s important to monitor for both performance and health, not either/or.
Ignorance of what log files contain what information
You can read all the documentation you want, but knowing which log file to examine and what to look for when something goes wrong is a skill that only comes with a lot of experience with Hadoop. Customers often wonder where their stdout/stderr streams went when they deploy their first jobs on a large cluster. They’ll also ask us what caused a task to run longer than usual, or what caused it to fail when it hasn’t before. Most of this information is logged, but it’s hard to know where to look. Also, when you have enough experience, you start to recognize common patterns in log files for cross-process interactions like a busy data node causing HBase to crash or a bad network switch causing a map task to appear to time out. Our philosophy is to help people get up to speed as quickly as possible. We blog (http://blog.cloudera.com/blog/2009/09/apache-hadoop-log-files-where-to-find-them-in-cdh-and-what-info-they-contain/), train (http://www.cloudera.com/hadoop-training/) and of course provide individualized services (http://www.cloudera.com/hadoop-services/).
Drastic measures to address simple problems
When one node experiences a problem with one Hadoop process, it’s tempting to take drastic measures, such as killing additional processes on other nodes or taking entire nodes offline until you’ve killed the cluster. If you experience a problem with a data node or a group of data nodes, it’s tempting to assume that you’ve lost data and try to restore your filesystem from a namenode snapshot. If you’re unaware or untrusting that Hadoop’s replication ensures that you probaby lost no data, you’re tempted to take a measure that assures that you will lose data: restoring from an older snapshot. If you are experiencing task failures or disk errors due to log file proliferation, it’s tempting to assume you’re working with unstable software, wipe the cluster clean and start over. Our customers have made all of these mistakes, and we’ve worked with them to set things straight. When complicated distributed systems go wrong, it’s easy to react too drastically. With proper troubleshooting and a good run book from us, you can get back and going with less pain.
Inadvertent introduction of single points of failure
Hadoop is designed to be a highly redundant distributed system. But as you deploy it, sometimes, it’s difficult to keep it that way. For example, a common misunderstanding of Hadoop is that because it’s the only source of filesystem metadata, the namenode represents a single point of failure. This is only the case if you don’t configure the namenode to store its metadata in multiple places, preferably over an NFS store on the network. Other single-points of failure introduced by erroneous setups include: Having $HADOOP_HOME be an NFS mount (causing a distributed denial of service attack on the NFS server), running an HBase Zookeeper on the same node as an HBase region server, or non-existence or improperly configured domain name service (DNS) in the cluster.
Over reliance on defaults
If you buy a machine with a lot of disk and RAID support, you might not be aware that RAID is turned on by default, though it is not recommended for use with Hadoop. Likewise, if you take the defaults of a Hadoop configuration, you might take a default partitioning scheme that doesn’t apply to your disk allocation, or a dfs.name.dir setting that makes no sense whatsoever. We saw one cluster with an entire disk dedicated for log files and temp files, but because the defaults didn’t get changed in the configuration files, the disk was going unutilized while other disks were filling up.
Unlike a lot of applications, you want to be very wary of Hadoop defaults, and give a lot of thought to key configuration settings that are available to you. Because Hadoop has a hierarchical configuration system, where the user provides files that overlay instead of editing existing ones, it’s easy to overlook or forget about defaults that affect you. Hadoop has a lot of knobs, and in some cases the defaults are OK, but our customers commonly hit issues when they accept a default they shouldn’t have. We’ve also blogged about which configuration details you should pay attention to (http://blog.cloudera.com/blog/2009/03/configuration-parameters-what-can-you-just-ignore/).
Cloudera can help now and in the future. From day one we’ve been putting a lot of work into CDH to make it easier to administer (http://blog.cloudera.com/blog/2009/04/clouderas-distribution-for-hadoop-making-hadoop-easier-for-a-sysadmin/). If you haven’t taken a look at Cloudera Enterprise, we’re building tools on top make Hadoop more accessible and simplify administration and maintenance of the cluster as a whole. Having a partner to help you manage the logistics of deployment, administration and maintenance will save you time, money and headaches.