Improvements in the Hadoop YARN Fair Scheduler
Starting in CDH 4.2, YARN/MapReduce 2 (MR2) includes an even more powerful Fair Scheduler. In addition to doing nearly all that it could do in MapReduce 1 (MR1), the YARN Fair Scheduler can schedule non-MapReduce jobs, schedule based on fine-grained memory instead of slots, and support hierarchical queues. In this post, you’ll learn what the Fair Scheduler’s role is and how it fulfills it, what it means to be a YARN “scheduler,” and dive into its new features and how to get them running on your cluster.
YARN/MR2 vs. MR1
YARN uses an updated terminology to reflect that it no longer just manages resources for MapReduce. From YARN’s perspective, a MapReduce job is an application. YARN schedules containers for map and reduce tasks to live in. What was referred to as pools in the MR1 Fair Scheduler has been updated to queue for consistency with the capacity scheduler. An excellent and deeper explanation is available here.
How Does it Work?
How a Hadoop scheduler functions can often be confusing, so we’ll start with a short overview of what the Fair Scheduler does and how it works.
A Hadoop scheduler is responsible for deciding which tasks get to run where and when to run them. The Fair Scheduler, originally developed at Facebook, seeks to promote fairness between schedulable entities by awarding free space to those that are the most underserved. (Cloudera recommends the Fair Scheduler for its wide set of features and ease of use, and Cloudera Manager sets it as the default. More than 95% of Cloudera’s customers use it.)
More than 95% of all Cloudera customers use the Fair Scheduler in their deployments.
In Hadoop, the scheduler is a pluggable piece of code that lives inside ResourceManager (the JobTracker, in MR1) the central execution managing service. The ResourceManager is constantly receiving updates from the NodeManagers that sit on each node in the cluster, that say “What’s up, here are all the tasks I was running that just completed, do you have any work for me?” The ResourceManager passes these updates to the scheduler, and the scheduler then decides what new tasks, if any, to assign to that node.
How does the scheduler decide? For the Fair Scheduler, it’s simple: every application belongs to a “queue”, and we give a container to the queue that has the fewest resources allocated to it right now. Within that queue, we offer it to the application that has the fewest resources allocated to it right now. The Fair Scheduler supports a number of features that modify this a little, like weights on queues, minimum shares, maximum shares, and FIFO policy within queues, but the basic idea remains the same.
In MR1, the Fair Scheduler was purely a MapReduce scheduler. If you wanted to run multiple parallel computation frameworks on the same cluster, you would have to statically partition resources — or cross your fingers and hope that the resources given to a MapReduce job wouldn’t also be given to something else by that framework’s scheduler, causing OSes to thrash. With YARN, the same scheduler can manage resources for different applications on the same cluster, which should allow for more multi-tenancy and a richer, more diverse Hadoop ecosystem.
Scheduling Resources, Not Slots
A big change in the YARN Fair Scheduler is how it defines a “resource”. In MR1, the basic unit of scheduling was the “slot”, an abstraction of a space for a task on a machine in the cluster. Because YARN expects to schedule jobs with heterogeneous task resource requests, it instead allows containers to request variable amounts of memory and schedules based on those. Cluster resources no longer need to be partitioned into map and reduce slots, meaning that a large job can use all the resources in the cluster in its map phase and then do so again in its reduce phase. This allows for better utilization of the cluster, better treatment of tasks with high resource requests, and more portability of jobs between clusters — a developer no longer needs to worry about a slot meaning different things on different clusters; rather, they can request concrete resources to satisfy their jobs’ needs. Additionally, work is being done (YARN-326) that will allow the Fair Scheduler to schedule based on CPU requirements and availability as well.
An implementation detail of this change that prevents applications from starving under this new flexibility is the notion of reserved containers. Imagine two jobs are running that each have enough tasks to saturate more than the entire cluster. One job wants each of its mappers to get 1GB, and another job wants its mappers to get 2GB. Suppose the first job starts and fills up the entire cluster. Whenever one of its task finishes, it will leave open a 1GB slot. Even though the second job deserves the space, a naive policy will give it to the first one because it’s the only job with tasks that fit. This could cause the second job to be starved indefinitely.
One big change in the YARN Fair Scheduler is how it defines a “resource”.
To prevent this unfortunate situation, when space on a node is offered to an application, if the application cannot immediately use it, it reserves it, and no other application can be allocated a container on that node until the reservation is fulfilled. Each node may have only one reserved container. The total reserved memory amount is reported in the ResourceManager UI. A high number means that it may take longer for new jobs to get space.
Perhaps the most significant addition to the Fair Scheduler in YARN is support for hierarchical queues. Queues may now be nested inside other queues, each queue to split the resources allotted to it among subqueues. Queues are most often used to delineate organizational boundaries, and this now allows a topology that can better reflect organizational hierarchies. We can say that the Engineering and Marketing departments both get half the cluster and then each may split its half among sub-organizations or functions. Another common use of queues is to divide jobs by their characteristics — one queue might house long running jobs with high resource requirements, while another carves out a fast lane for time-sensitive queries. Now, we can put fast/slow lanes under every team, allowing us to concurrently account for inter-organizational fairness and intra-organizational performance requirements.
From a technical standpoint, queue hierarchies define the procedure for assigning tasks to resources when they become available. All queues descend from a queue named “root”. When resources become available they are assigned to child queues of the root queue according to the typical fair scheduling policy. Then, these children distribute the resources assigned to them to their children with the same policy. When calculating current allocations for fairness, all the applications in all subqueues of a queue are considered. Applications may only be scheduled on leaf queues.
Configuring hierarchical queues is simple: Queues can be specified as children of other queues by placing them as sub-elements of their parents in the Fair Scheduler allocations file.
The following is an example allocations file (fair-scheduler.xml) that splits resources first between the high-level divisions, and second, between their teams:
<allocations> <queue name="Marketing"> <minShare>8192</minShare> <queue name="WebsiteLogsETL"> <weight>1.0</weight> </queue> <queue name="CustomerDataAnalysis"> <weight>2.0</weight> <queue name=”FastLane”> <weight>3.0</weight> <maxShare>4096</maxShare> </queue> <queue name=”Regular”> </queue> </queue> </queue> <queue name="Engineering"> <queue name="ImportantProduct"> <weight>2.0</weight> </queue> <queue name="UnimportantProduct"> <weight>1.0</weight> </queue> </queue> </allocations>
We’ve given Marketing a minimum share of 8192MB, meaning that tasks from Marketing jobs will get first priority until Marketing is using at least 8192MB. We’ve given CustomerDataAnalysis team a fast lane that will get 3MB for every 1MB that its regular queue gets but can’t fit more than 4096MB. A queue’s name is now prefixed with the names of its parent queues, so the CustomerDataAnalysis team’s fast lane’s full name would be
root.Marketing.CustomerDataAnalysis.FastLane. For easier use, we can omit the root queue prefix when referring to a queue, so we could submit an application to it with
What’s Gone in the YARN Fair Scheduler?
The YARN Fair Scheduler no longer supports moving applications between queues. Applications also may no longer be given priorities within a leaf queue. Both of these features will likely be added back in the future.
The improvements described here are just a start; a number of additional features are planned for the Fair Scheduler. With multi-resource scheduling, fairness will be determined with respect to CPU usage as well as memory usage, and each queue in the hierarchy will be able to have a custom scheduling policy. Gang scheduling will allow YARN applications to request “gangs” of containers that will all be made available at the same time. Guaranteed shares will allow queues to have resources sectioned off for only their use – unlike minimum shares, other queues will not be able to occupy them, even if the queue is not using them.
Look forward to another post describing these features when they arrive!
Sandy Ryza is a Software Engineer on the Platform team and a contributor to Hadoop/MapReduce/YARN.