by Peter Bacsko, and Rudolf Reti

Posted in Technical | July 16, 2020 10 min read

Introduction

In Apache Hadoop YARN 3.x (YARN for short), switching to Capacity Scheduler has considerable benefits and only a few drawbacks. To bring these features to users who are currently using Fair Scheduler, we created a tool with the upstream YARN community to help the migration process.

Why switching to Capacity Scheduler

What can we gain by switching to Capacity Scheduler? A couple of examples:

Scheduling throughput improvements
- Look at several nodes at one time
- Fine-grained locks
- Multiple allocation threads
- 5-10x throughput gains
Node partitioning and labeling
Affinity and anti-affinity: run application X only on those nodes which run application Y and the other way around, never run application X and application Y on the same node.
Scheduler and application activities: messages used for debugging important scheduling decisions, they can be recorded and exposed via RESTful API.

Also, with the release of CDP, Cloudera’s vision is to support Capacity Scheduler as a default scheduler for YARN and phase out Fair Scheduler. Supporting two schedulers simultaneously poses problems: not only does it require more support and engineering competence, but also it needs extra testing, more complicated test cases and test suites due to feature gaps.

After a long and careful analysis, we decided to choose Capacity Scheduler as the default scheduler. We put together a document that compares the features of Capacity Scheduler and Fair Scheduler under YARN-9698 (Direct link).

If you are new to this topic, you can get familiar with Capacity Scheduler by reading the following article: YARN – The Capacity Scheduler

What will be covered by this blogpost

In this blog post, we will:

Introduce the Fair Scheduler -> Capacity Scheduler conversion tool
Describe how it works internally
Explain the command line switches
Present examples of how to use the tool
Explain the limitations of the tool, because a 100% automatic conversion from Fair Scheduler to Capacity Scheduler is not yet possible
Talk about future plans

Please note that although we tested the tool with various Fair Scheduler and YARN site configurations, it’s a new addition to Apache Hadoop. Manual inspection and review of the generated output files are highly recommended.

Introducing the fs2cs converter tool

The converter itself is a CLI application that is part of the yarn command. To invoke the tool, you need to use the yarn fs2cs command with various command-line arguments.

The tool generates two files as output: a capacity-scheduler.xml and a yarn-site.xml. Note that the site configuration is just a delta: it contains only the new settings for the Capacity Scheduler, meaning that you have to manually copy-paste these values into your existing site configuration. Leaving the existing Fair Scheduler properties is unlikely to cause any harm or malfunction, but we recommend removing them to avoid confusion.

The generated properties can also go to the standard output instead of the aforementioned files.

The tool is officially part of the CDH to CDP upgrade, which is documented here.

Using the converter from the command line

The basic usage is:

yarn fs2cs -y /path/to/yarn-site.xml [-f /path/to/fair-scheduler.xml] {-o /output/path/ | -p} [-t] [-s] [-d]

Switches listed between [] braces are optional. Curly braces {} mean that the switch is mandatory and you have to pick one.

You can also use the long version of them:

yarn fs2cs --yarnsiteconfig /path/to/yarn-site.xml [--fsconfig /path/to/fair-scheduler.xml] {--output-directory /output/path/ | --print} [--no-terminal-rule-check] [--skip-verification] [--dry-run]

Example:

yarn fs2cs --yarnsiteconfig /home/hadoop/yarn-site.xml --fsconfig /home/hadoop/fair-scheduler.xml --output-directory /tmp

Important: always use the absolute path for -f / –fsconfig.

For a list of all command-line switches with explanation, you can use yarn fs2cs –help. The CLI options are listed in this document.

Step-by-step example of using fs2cs

Let’s see a short demonstration of the tool.

Existing config

Suppose we have the following simple fair-scheduler.xml:

<allocations>
    <queue name="root">
        <weight>1.0</weight>
        <schedulingPolicy>drf</schedulingPolicy>
        <queue name="default">
            <weight>1.0</weight>
            <schedulingPolicy>drf</schedulingPolicy>
        </queue>
        <queue name="users" type="parent">
            <maxChildResources>memory-mb=8192, vcores=1</maxChildResources>
            <weight>1.0</weight>
            <schedulingPolicy>drf</schedulingPolicy>
        </queue>
    </queue>
    <queuePlacementPolicy>
        <rule name="specified" create="true"/>
        <rule name="nestedUserQueue" create="true">
            <rule name="default" create="true" queue="users"/>
        </rule>
        <rule name="default"/>
    </queuePlacementPolicy>
</allocations>

We also have the following entries in yarn-site.xml (listing only those which are related to Fair Scheduler):

yarn.scheduler.fair.allow-undeclared-pools = true
yarn.scheduler.fair.user-as-default-queue = true
yarn.scheduler.fair.preemption = false
yarn.scheduler.fair.preemption.cluster-utilization-threshold = 0.8
yarn.scheduler.fair.sizebasedweight = false
yarn.scheduler.fair.assignmultiple = true
yarn.scheduler.fair.dynamicmaxassign = true
yarn.scheduler.fair.maxassign = -1
yarn.scheduler.fair.continuous-scheduling-enabled = false
yarn.scheduler.fair.locality-delay-node-ms = 2000

Run fs2cs converter

Let’s run the converter for these files:

~$ yarn fs2cs -y /home/examples/yarn-site.xml -f /home/examples/fair-scheduler.xml -o /tmp

2020-05-05 14:22:41,384 INFO  [main] converter.FSConfigToCSConfigConverter (FSConfigToCSConfigConverter.java:prepareOutputFiles(138)) - Output directory for yarn-site.xml and capacity-scheduler.xml is: /tmp

2020-05-05 14:22:41,388 INFO  [main] converter.FSConfigToCSConfigConverter (FSConfigToCSConfigConverter.java:loadConversionRules(177)) - Conversion rules file is not defined, using default conversion config!

[...] output trimmed for brevity

2020-05-05 14:22:42,572 ERROR [main] converter.FSConfigToCSConfigConverterMain (MarkerIgnoringBase.java:error(159)) - Error while starting FS configuration conversion!

[...] output trimmed for brevity

Caused by: org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationConfigurationException: Rules after rule 2 in queue placement policy can never be reached
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueuePlacementPolicy.updateRuleSet(QueuePlacementPolicy.java:115)

[...]

This is a very typical error. If we look at the placement rules of fair-scheduler.xml, we can see that a default rule comes after nestedUserQueue. We need to use the –no-terminal-rule-check switch to ignore terminal rule check inside Fair Scheduler. Why? See the section below.

By default, Fair Scheduler performs a strict check of whether a placement rule is terminal or not. This means that if you use a <reject> rule which is followed by a <specified> rule, then this is not allowed, since the latter is unreachable. However, before YARN-8967 (Change FairScheduler to use PlacementRule interface), Fair Scheduler was more lenient and allowed certain sequences of rules that are no longer valid. As said before, we instantiate a Fair Scheduler instance with the tool to read and parse the allocation file. In order to have Fair Scheduler accept such configurations, the -t or –no-terminal-rule-check argument must be supplied to suppress the exception thrown by Fair Scheduler. In CDH 5.x, these kinds of placement configurations are common, so it’s recommended to always use -t.

Run the tool again with –no-terminal-rule-check

~$ yarn fs2cs -y /home/examples/yarn-site.xml -f /home/examples/fair-scheduler.xml -o /tmp --no-terminal-rule-check


2020-05-05 14:41:39,189 INFO  [main] capacity.CapacityScheduler (CapacityScheduler.java:initScheduler(384)) - Initialized CapacityScheduler with calculator=class org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator, minimumAllocation=<<memory:1024, vCores:1>>, maximumAllocation=<<memory:8192, vCores:4>>, asynchronousScheduling=false, asyncScheduleInterval=5ms,multiNodePlacementEnabled=false

2020-05-05 14:41:39,190 INFO  [main] converter.ConvertedConfigValidator (ConvertedConfigValidator.java:validateConvertedConfig(72)) - Capacity scheduler was successfully started

This time, the conversion succeeded!

Notes about output log of the conversion tool

There are a couple of things worth mentioning from the log:

Fair Scheduler does not throw an exception, it just prints the warning that a rule is unreachable.
Two warnings are displayed during the conversion:

2020-05-05 14:41:38,908 WARN  [main] converter.FSConfigToCSConfigRuleHandler (ConversionOptions.java:handleWarning(48)) - Setting <userMaxAppsDefault> is not supported, ignoring conversion

2020-05-05 14:41:38,945 WARN  [main] converter.FSConfigToCSConfigRuleHandler (ConversionOptions.java:handleWarning(48)) - Setting <maxChildResources> is not supported, ignoring conversion

As mentioned earlier, there are some feature gaps between the two schedulers and by default, a warning is printed whenever an unsupported setting is detected. This can be useful for operators to see what features they have to fine-tune after the upgrade.

It can be clearly seen that a Capacity Scheduler instance was started to verify that the converted configuration is valid.

Look at converted configurations

If we look at /tmp/yarn-site.xml, we can see that it is really short:

yarn.scheduler.capacity.resource-calculator =
org.apache.hadoop.yarn.util.resource.DominantResourceCalculator
yarn.scheduler.capacity.per-node-heartbeat.multiple-assignments-enabled = true
yarn.resourcemanager.scheduler.class = org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler

Well, there is not much here. It is because a lot of scheduling related settings are disabled: there is no preemption, there is no continuous scheduling, there is no rack or node locality threshold set.

Let’s take a look at the new capacity-scheduler.xml (again, this is formatted here and the unnecessary XML tags were removed):

yarn.scheduler.capacity.root.users.maximum-capacity = 100
yarn.scheduler.capacity.root.default.capacity = 50.000
yarn.scheduler.capacity.root.default.ordering-policy = fair
yarn.scheduler.capacity.root.users.capacity = 50.000
yarn.scheduler.capacity.root.default.maximum-capacity = 100
yarn.scheduler.capacity.root.queues = default,users
yarn.scheduler.capacity.root.maximum-capacity = 100
yarn.scheduler.capacity.maximum-am-resource-percent = 0.5

Notice the property yarn.scheduler.capacity.maximum-am-resource-percent which is set to 0.5. This is missing from fair-scheduler.xml, so why is it here? The tool has to set it, because the default setting in Capacity Scheduler is 10%, but in Fair Scheduler, it is 50%.

Let’s modify the following properties:

yarn.scheduler.fair.preemption - true
yarn.scheduler.fair.sizebasedweight - true
yarn.scheduler.fair.continuous-scheduling-enabled - true

After running the conversion again, these settings are now reflected in the new yarn-site.xml:

yarn.scheduler.capacity.resource-calculator = 
org.apache.hadoop.yarn.util.resource.DominantResourceCalculator
yarn.scheduler.capacity.schedule-asynchronously.scheduling-interval-ms = 5
yarn.scheduler.capacity.schedule-asynchronously.enable = true
yarn.resourcemanager.monitor.capacity.preemption.monitoring_interval = 10000
yarn.resourcemanager.monitor.capacity.preemption.max_wait_before_kill = 15000
yarn.scheduler.capacity.per-node-heartbeat.multiple-assignments-enabled = true
yarn.resourcemanager.scheduler.class = 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler
yarn.resourcemanager.scheduler.monitor.enable = true

The size-based weight setting has also affected capacity-scheduler.xml:

yarn.scheduler.capacity.root.default.ordering-policy.fair.enable-size-based-weight = true
yarn.scheduler.capacity.root.users.ordering-policy.fair.enable-size-based-weight = true
yarn.scheduler.capacity.root.users.capacity = 50.000
yarn.scheduler.capacity.root.queues = default,users
yarn.scheduler.capacity.root.users.maximum-capacity = 100
yarn.scheduler.capacity.root.ordering-policy.fair.enable-size-based-weight = true
[...] rest is omitted because it’s the same as before

Conversion of weights

How weights work in FairScheduler

A crucial question is how to convert weights. The weight determines the “fair share” of a queue in the long run. Fair share is the amount of resources a queue can get, limiting how much resources applications that are submitted to that given queue can use.

For example, if “root.a” and “root.b” have weights of 3 and 1, respectively, then “root.a” will get 75% of the total cluster resources and “root.b” will get 25%.

But what if we submit applications to “root.b” only? As long as “root.a” is empty, applications in “root.b” can freely occupy the whole cluster (let’s ignore <maxResources> for now).

How can we emulate weights in Capacity Scheduler?

It turns out that Capacity Scheduler’s “capacity” is very close to the concept of weight, only that it is expressed as percentages, not as integers. But by default, capacity is capped – meaning that “root.b” with a capacity of 25.00 will always use only 25% of the cluster. This is where the concept of elasticity comes in. Elasticity means that free resources in the cluster can be allocated to a queue beyond its default capacity. This value is also expressed in percentages. Therefore, we have to enable full elasticity for all queues.

A simple example of FairScheduler weights to CapacityScheduler configs

In summary, we can achieve Fair Scheduler-like behavior with the following properties:

Weights in Fair Scheduler:

root.a = 3
root.b = 1

The respective settings for Capacity Scheduler:

yarn.scheduler.capacity.root.a.capacity = 75.000
yarn.scheduler.capacity.root.a.maximum-capacity = 100.000
yarn.scheduler.capacity.root.b.capacity = 25.000
yarn.scheduler.capacity.root.b.maximum-capacity = 100.000

Another example with hierarchical queues

Let’s imagine the following simple queue hierarchy with weights in Fair Scheduler:

root = 1
root.users = 20
root.default = 10
root.users.alice = 3
root.users.bob = 1

This results in the following capacity values after the conversion:

yarn.scheduler.capacity.root.capacity = 100.000
yarn.scheduler.capacity.root.maximum-capacity = 100.000
yarn.scheduler.capacity.root.users.capacity = 66.667
yarn.scheduler.capacity.root.users.maximum-capacity = 100.000
yarn.scheduler.capacity.root.default.capacity = 33.333
yarn.scheduler.capacity.root.default.maximum-capacity = 100.000
yarn.scheduler.capacity.root.users.alice.capacity = 75.000
yarn.scheduler.capacity.root.users.alice.maximum-capacity = 100.000
yarn.scheduler.capacity.root.users.bob.capacity = 25.000
yarn.scheduler.capacity.root.users.bob.maximum-capacity = 100.000

How the fs2cs tool works internally

After performing some basic validation steps (for example, if the output directory exists, input files exist, etc), it loads the yarn-site.xml and converts scheduling related properties such as preemption, continuous scheduling and rack/node locality settings.

The tool uses a Fair Scheduler instance to load and parse the allocation file. It also detects unsupported properties and displays a warning message each, that the particular setting will not be converted. Unsupported settings and known limitations will be explained later in this post.

Once the conversion is done and the output files are generated, the last step is validation. By default, fs2cs tries to start Capacity Scheduler internally using the converted configuration.

This step ensures that the Resource Manager is able to start up properly using the new configuration.

Known limitations

At the moment there are some feature gaps between Fair Scheduler and Capacity Scheduler – that is, a full conversion is possible only if you don’t use settings in your Fair Scheduler configuration currently not implemented in Capacity Scheduler.

Reservation system

Conversion of the reservation system settings is completely skipped and this will probably not change in the foreseeable future. The reason is that it is not a frequently used feature and it works completely differently in the two schedulers.

Placement rules

A placement rule in Fair Scheduler defines which queue a submitted YARN application should be placed into. Placement rules follow a “fall through” logic: if the first rule does not apply (the queue returned by the rule does not exist), then the next one is attempted and so on. If the last rule fails to return a valid queue, then the application submission is rejected.

Capacity Scheduler employs a conceptually similar approach called mapping rules. The implementation is different though: converting placement rules to mapping rules cannot be done properly at the moment. There are multiple reasons for this:

If a mapping rule matches, it returns a queue and does not proceed to the next one. It is either going to be a specific queue or root.default.
Mapping rules use placeholders like %primary_group, %secondary_group and %user. This is very similar to what we have in Fair Scheduler. However, it lacks %specified.
Placement rules can have a create flag. If create=true, then the queue is created dynamically. Capacity Scheduler does not have an auto-queue creation feature on a per rule basis. It is capable of creating queues on demand if the parent is a so-called managed parent (the property auto-create-child-queue is enabled). But managed parent queues cannot have static leaf queues, ie. children under them cannot be defined in capacity-scheduler.xml.
Nested rules for primary and secondary groups further complicate matters because the create flag is interpreted on both the outer and the inner rules.

These differences make it difficult, sometimes impossible to convert placement rules to mapping rules. In such circumstances, cluster operators have to be creative and deviate from their original placement algorithm.

Unsupported properties

The following properties will not be converted by the tool:

Maximum applications per user
<userMaxAppsDefault> – default maximum applications per user
<minResources> – minimum resources for a queue
<maxResources> – maximum resources for a queue
<maxChildResources> – maximum resources for a dynamically created queue
DRF ordering policy on queue level: in Capacity Scheduler, DRF has to be global. In Fair Scheduler, it’s possible to use the normal “Fair” policy under DRF parent.

Future improvements

There is still active development going on to provide a better user experience. The most important tasks are:

Handle vector of percentages as a resource in Capacity Scheduler (YARN-9936)
Users will be able to define not just a single capacity, but multiple values for different resources.
Handle maxRunningApps per user and userMaxAppsDefault (YARN-9930)
We have “maximum applications per user” setting, but it’s not directly configurable and cumbersome, because it’s a combination of three settings. We also have to be careful not to break existing behaviour – the existing logic in Capacity Scheduler rejects application submission in case max settings are exceeded, whereas in Fair Scheduler, the application is always accepted and will be scheduled later.
Handle minResources, maxResources and maxChildResources
These depend highly on YARN-9936. In Fair Scheduler, users can express these settings in a variety of ways (single percentage, two separate percentages or absolute resources). In order to support similar settings in Capacity Scheduler, we need YARN-9936.
Making mapping rules behave similar to the implementation that exists in Fair Scheduler
The way how mapping rules are evaluated was explained in the “Placement rules” section. We probably need a new, pluggable approach – this way, we won’t introduce regression to the existing codebase, which is already quite complex.
Improvements regarding DRF and other scheduling policies (YARN-9892)
Currently, we have a single global resource calculator defined by the property yarn.scheduler.capacity.resource-calculator. This is more fine-grained in Fair Scheduler.
Generic fine-tuning regarding the whole conversion process
There are properties in Capacity Scheduler such as “user-limit-factor” or “minimum-user-limit-percent”. We don’t utilize these settings right now, but it might turn out that in certain configurations, they are proven useful.

Conclusion

The fs2cs tool has become an integral part of the CDH-CDP upgrade path which helps customers transform their Fair Scheduler-based configuration to Capacity Scheduler. We learned why switching to Capacity Scheduler has clear benefits.

We have seen that not everything is perfect at the moment. There are certain features in Fair Scheduler that are either missing or just partly supported in Capacity Scheduler. When such a setting is encountered during the conversion, the tool displays a warning message.

Some aspects of the conversion are quite challenging, especially the placement rules. Even though conceptually similar, the two schedulers follow slightly different queue placement philosophies and this requires extra effort on our part to make Capacity Scheduler mapping rules work the same way.

Nonetheless, we are committed to implementing all necessary changes to increase customer satisfaction and improve user experience.

Acknowledgments

Thanks to Wilfred Spiegelenburg and Prabhu Josephraj for creating the table about the differences of the two schedulers and providing useful insights and ideas.

Szilard Németh contributed to the tool itself by handling and processing command line arguments and he also reviewed this blogpost.

I’d also like to thank those who provided useful feedback: Ádám Antal, Rudolf Réti, Bejnámin Teke, András Győri, Wangda Tan.

Peter Bacsko

Sr Software Engineer

More by this author

Rudolf Reti