A common question on the Apache Hadoop mailing lists is what’s going on with availability? This post takes a look at availability in the context of Hadoop, gives an overview of the work in progress and where things are headed.
When discussing Hadoop availability people often start with the NameNode since it is a single point of failure (SPOF) in HDFS, and most components in the Hadoop ecosystem (MapReduce, Apache HBase, Apache Pig, Apache Hive etc) rely on HDFS directly, and are therefore limited by its availability. However, Hadoop availability is a larger, more general issue, so it’s helpful to establish some context before diving in.
Availability is the proportion of time a system is functioning , which is commonly referred to as “uptime” (vs downtime, when the system is not functioning).
Note that availability is a stricter requirement than fault tolerance – the ability for a system to perform as designed and degrade gracefully in the presence of failures. A system that requires an hour to restart (eg for a configuration change or software upgrade) but has no single point of failure is fault tolerant but not highly available (HA). Adding redundancy in all SPOFs is a common way to improve fault tolerance, which helps , but is just a part of, improving Hadoop availability. Note also that fault tolerance is distinct from durability, even though the NameNode is a SPOF no single failure results in data loss as copies of NameNode persistent state (the image and edit log) are replicated both within and across hosts.
Availability is also often conflated with reliability. Reliability in distributed systems is a more general issue than availability . A truly reliable distributed system must be highly available, fault tolerant, secure, scalable, and perform predictably, etc. I’ll limit this post to Hadoop availability.
Reasons for downtime
An important part of improving availability and articulating requirements is understanding the causes of downtime. There are many types of failures in distributed systems, ways to classify them, and analyses of how failures result in downtime. Rather than go into depth here, I’ll briefly summarize some general categories of issues that may cause downtime:
1. Maintenance – Hardware and software may need to be upgraded, configuration changes may require a system restart, and operational tasks for dependent systems. Hadoop can handle most maintenance to slave hosts without downtime; however maintenance to a master host normally requires a restart of the entire system.
2. Hardware failures – Hosts and their connections may fail. Without redundant devices, or redundant components within devices, a single component failure may cause the entire device to fail. Hadoop can tolerate hardware failures (even silent failures like corruption) to slave hosts without downtime; however some hardware failures on the master host (or a failure in the connection between the master and the majority of the slaves) can cause system downtime .
3. Software failures – Software bugs may cause a component in the system to stop functioning or require a restart. For example, a bug in upgrade code could result in downtime due to data corruption. A dependent software component may become unavailable (eg the Java garbage collector enters a stop-the-world phase). Hadoop can tolerate some software bugs without downtime; however components are generally designed to fail-fast – to stop and notify other components of failure rather than attempt to continue a possibly-flawed process. Therefore a software bug in a master service will likely cause downtime.
4. Operator errors – People make mistakes. From disconnecting the wrong cable, to mis-configured hosts, to typos in configuration files, operator errors can cause downtime. Hadoop attempts to limit operator error by simplifying administration, validating its configuration, and providing useful messages in logs and UI components; however operator mistakes may still cause downtime.
In order for a system to be highly available, its design needs to anticipate these various failures. Removing single points of failure, enabling rolling upgrades, faster restarts, making the system robust and user friendly, etc are all necessary to improve availability. Given that improving availability requires a multi-prong approach, let’s take a look at the relevant use cases for limiting downtime.
1. Host maintenance – If an operator needs to upgrade or replace the primary host hardware or upgrade its operating system, they should be able to manually fail over to a hot standby, perform the upgrade and optionally fail back to the primary. The fail-over should be transparent to clients accessing the system (eg active jobs continue to run). Host maintenance to slave hosts can be handled without downtime today by de-commissioning the host.
2. Configuration changes – Ideally configuration changes to masters should not require a system restart — the configuration can be updated in-place or fail-over to a hot standby with an updated configuration is supported. In cases when they do, the operator should be able to restart the system with minimal impact to running workloads.
3. Software upgrades – An operator should be able to upgrade Hadoop’s software in-place (a “rolling upgrade”) on slave nodes and via fail-over on the master hosts so there is little or no downtime. If a restart is required it should be accelerated by quickly re-constructing the system’s state.
4. Host failures – If a non-redundant hardware component fails, the operating system crashes, a disk partition runs out of space, etc. the system should detect the failure, and, depending on the service and failure, (a) recover, (b) de-commission itself, or (c) fail over to a hot standby. Hadoop currently tolerates slave host failures without downtime, however master host failures often cause downtime. In practice, for a number of reasons, master hardware failures do not cause as much downtime as you might expect:
- In large clusters it is statistically improbable that a hardware failure impacts a machine running master services, and operations teams are often good at keeping a small number of well-known hosts healthy.
- Because there are few master hosts redundant hardware components can be used to limit the probability of a host failure without dramatically increasing the price of the overall system.
Highly Available Hadoop
A number of efforts are under way to improve Hadoop availability, and implement missing functionality required by the above use cases. Tasks related to HDFS availability are tracked here, tasks related to MapReduce availability are tracked here.
1. Improvements to Hadoop’s failure handling code. Hadoop’s native fault injection framework and other related frameworks continue to make Hadoop more robust in the face of failures. Recent advances in failure testing applied have been successfully applied to Hadoop  to identify software bugs (eg HDFS-1231, HDFS-1225, and HDFS-1228).
4. Work has started to allow Hadoop client and server software of different versions to co-exist, with the goal of enabling in-place Hadoop software upgrades.
5. There have been efforts to make existing releases of HDFS more highly-available, as well as several research prototypes (eg UpRight-HDFS, NameNode Cluster, and HDFS-dnn) that examine HDFS availability. HDFS developers are currently working on a hot standby for the NameNode to improve on the existing NameNode fail-over. Like the Google File System which has “shadow masters”, this allows HDFS to fail the NameNode process from one host to another by actively replicating all NameNode state required to quickly restart the process. Integrating the BackupNode (edits are streamed from the primary NameNode to one or more BackupNodes) or using BookKeeper (a replicated service to reliably log streams of records that can be used for the edits log) with the AvatarNode (which replicates block reports across a primary and backup host) results in a standby NameNode that can be activated if the NameNode fails. Automatic hot fail-over can be achieved by integrating both clients and servers with ZooKeeper. A similar approach has been successfully used by Google to make GFS highly available :
Initially, GFS had no provision for automatic master failover. It was a manual process. Although it didn’t happen a lot, whenever it did, the cell might be down for an hour. Even our initial master-failover implementation required on the order of minutes. Over the past year, however, we’ve taken that down to something on the order of tens of seconds.
This Active/Passive design enables both high availability and evolution towards being a better storage layer for systems like HBase, which in turn could be used to store metadata for a new version of HDFS (similar to Google’s GFS2). Like HDFS federation, this provides an evolutionary path to high scalability without the complexity of modifying HDFS to use multi-master replication.
6. The MapReduce master (JobTracker) state is stored in HDFS and is therefore limited by its availability. The JobTracker can be-restarted, however works needs to be to integrate it with a service like ZooKeeper to handle fail-over to a separate host.
Hopefully this post has helped frame the various tasks behind Hadoop’s march towards high availability in a useful context. The development community understands this is one of the most high priority issues for users, and is looking forward to providing a highly available Hadoop in up-coming releases. Similarly, Cloudera is committed to improving availability in CDH4 – it’s our primary focus for the release.
Thanks to Dhruba Borthakur, Doug Cutting, Sanjay Radia, and Konstantin Shvachko for reading drafts of this post.
 A common way to define availability is the ratio of the expected value of the uptime of the system to the aggregate of the expected values of uptime and downtime. Common metrics used are:
- Mean time between failures (MTBF) – the expected time between failures of a system during operation.
- Mean time to recovery (MTTR) – the average time the system will take to recover from any failure.
Using these metrics availability can be defined as MTBF / (MTBF + MTTR).
 I say “helps” because one the most common reasons for downtime (misconfiguration, operator error, and software bugs) are all exacerbated by system complexity, and making systems more fault tolerant often increases their complexity.
 “Reliability and availability are different: Availability is doing the right thing within the specified response time. Reliability is not doing the wrong thing.” from WHY DO COMPUTERS STOP AND WHAT CAN BE DONE ABOUT IT? by Jim Gray
 People often configure master hosts with redundant hardware components (nics, disks, IO controllers, and power units) so that an individual component failure does not cause the system to fail.
 Towards Automatically Checking Thousands of Failures with Micro-specifications. H Gunawi, T Do, et al. UC Berkeley TR EECS-2010-98.
 GFS: Evolution on Fast-Forward. Marshall Kirk McKusick, Sean Quinlan in the ACM Queue.