(guest blog post by Matei Zaharia)
When Apache Hadoop started out, it was designed mainly for running large batch jobs such as web indexing and log mining. Users submitted jobs to a queue, and the cluster ran them in order. However, as organizations placed more data in their Hadoop clusters and developed more computations they wanted to run, another use case became attractive: sharing a MapReduce cluster between multiple users. The benefits of sharing are tremendous: with all the data in one place, users can run queries that they may never have been able to execute otherwise, and costs go down because system utilization is higher than building a separate Hadoop cluster for each group. However, sharing requires support from the Hadoop job scheduler to provide guaranteed capacity to production jobs and good response time to interactive jobs while allocating resources fairly between users.
This July, the scheduler in Hadoop became a pluggable component and opened the door for innovation in this space. The result was two schedulers for multi-user workloads: the Fair Scheduler, developed at Facebook, and the Capacity Scheduler, developed at Yahoo.
The Fair Scheduler arose out of Facebook’s need to share its data warehouse between multiple users. Facebook started using Hadoop to manage the large amounts of content and log data it accumulated every day. Initially, there were only a few jobs that needed to run on the data each day to build reports. However, as other groups within Facebook started to use Hadoop, the number of production jobs increased. In addition, analysts started using the data warehouse for ad-hoc queries through Hive (Facebook’s SQL-like query language for Hadoop), and more large batch jobs were submitted as developers experimented with the data set. Facebook’s data team considered building a separate cluster for the production jobs, but saw that this would be extremely expensive, as data would have to be replicated and the utilization on both clusters would be low. Instead, Facebook built the Fair Scheduler, which allocates resources evenly between multiple jobs and also supports capacity guarantees for production jobs. The Fair Scheduler is based on three concepts:
- Jobs are placed into named “pools” based on a configurable attribute such as user name, Unix group, or specifically tagging a job as being in a particular pool through its jobconf.
- Each pool can have a “guaranteed capacity” that is specified through a config file, which gives a minimum number of map slots and reduce slots to allocate to the pool. When there are pending jobs in the pool, it gets at least this many slots, but if it has no jobs, the slots can be used by other pools.
- Excess capacity that is not going toward a pool’s minimum is allocated between jobs using fair sharing. Fair sharing ensures that over time, each job receives roughly the same amount of resources. This means that shorter jobs will finish quickly, while longer jobs are guaranteed not to get starved.
The scheduler also includes a number of features for ease of administration, including the ability to reload the config file at runtime to change pool settings without restarting the cluster, limits on running jobs per user and per pool, and use of priorities to weigh the shares of different jobs. There is currently no support for preemption of long tasks, but this is being added in HADOOP-4665, which will allow you to set how long each pool will wait before preempting other jobs’ tasks to reach its guaranteed capacity.
The Fair Scheduler has been in production use at Facebook since August. You can find it in the Hadoop trunk code under src/contrib/fairscheduler, and there are also versions of the scheduler for Hadoop 0.17 and Hadoop 0.18 on its JIRA page. All of these versions come with a README file explaining how to set up the scheduler that is placed under src/contrib/fairscheduler.
The Capacity Scheduler from Yahoo offers similar functionality to the Fair Scheduler but takes a somewhat different philosophy. In the Capacity Scheduler, you define a number of named queues. Each queue has a configurable number of map and reduce slots. The scheduler gives each queue its capacity when it contains jobs, and shares any unused capacity between the queues. However, within each queue, FIFO scheduling with priorities is used, except for one aspect – you can place a limit on percent of running tasks per user, so that users share a cluster equally. In other words, the capacity scheduler tries to simulate a separate FIFO/priority cluster for each user and each organization, rather than performing fair sharing between all jobs. The Capacity Scheduler also supports configuring a wait time on each queue after which it is allowed to preempt other queues’ tasks if it is below its fair share. Documentation for the scheduler can be built as described in its README file under src/contrib/capacity-scheduler in the Hadoop trunk SVN.
Now that the Fair Scheduler and Capacity Scheduler are available, there has been increased focus on other aspects of multi-user Hadoop clusters, such as isolating users and improving performance for the short interactive jobs seen in these environments. This has led to some exciting scheduling-related patches you can expect to see in future Hadoop releases:
- HADOOP-4487, which adds a number of security features to isolate users.
- HADOOP-3136, which lets the scheduler launch multiple tasks per heartbeat, improving “ramp-up time”.
- HADOOP-4664, 4513 and 4372, which parallelize job initialization to launch small jobs faster.
- HADOOP-2014, which chooses input blocks from overloaded racks when launching non-local maps.
- HADOOP-3759 and 657, which take into account tasks’ memory and disk space requirements to prevent oversubscribing nodes.
- HADOOP-4667, which improves locality for small jobs in the fair scheduler by letting it look at multiple jobs to select a local task.
With the recent progress on scheduling, Hadoop is quickly growing to support the kind of multi-user data warehouse seen at Facebook: short interactive jobs, large batch jobs, and guaranteed-capacity production jobs sharing a cluster and delivering results quickly while maintaining high throughput. With a job scheduler that protects production jobs, users can try interesting R&D experiments on your data set and gain valuable insights without worrying about affecting mission-critical jobs.