Untangling Apache Hadoop YARN, Part 1: Cluster and YARN Basics

Categories: Hadoop MapReduce YARN

In this multipart series, fully explore the tangled ball of thread that is YARN.

YARN (Yet Another Resource Negotiator) is the resource management layer for the Apache Hadoop ecosystem. YARN has been available for several releases, but many users still have fundamental questions about what YARN is, what it’s for, and how it works. This new series of blog posts is designed with the following goals in mind:

  • Provide a basic understanding of the components that make up YARN
  • Illustrate how a MapReduce job fits into the YARN model of computation. (Note: although Apache Spark integrates with YARN as well, this series will focus on MapReduce specifically. For information about Spark on YARN, see this post.)
  • Present an overview of how the YARN scheduler works and provide building-block examples for scheduler configuration

The series comprises the following parts:

  • Part 1: Cluster and YARN basics
  • Part 2: Global configuration basics
  • Part 3: Scheduler concepts
  • Part 4: FairScheduler queue basics
  • Part 5: Using FairScheduler queue properties

In this initial post, we’ll cover the fundamentals of YARN, which runs processes on a cluster similarly to the way an operating system runs processes on a standalone computer. Subsequent parts will be released every few weeks.

Cluster Basics (Master/Worker)

A host is the Hadoop term for a computer (also called a node, in YARN terminology). A cluster is two or more hosts connected by a high-speed local network. Two or more hosts—the Hadoop term for a computer (also called a node in YARN terminology)—connected by a high-speed local network are called a cluster. From the standpoint of Hadoop, there can be several thousand hosts in a cluster.

In Hadoop, there are two types of hosts in the cluster.

Figure 1: Master host and Worker hosts

Conceptually, a master host is the communication point for a client program. A master host sends the work to the rest of the cluster, which consists of worker hosts. (In Hadoop, a cluster can technically be a single host. Such a setup is typically used for debugging or simple testing, and is not recommended for a typical Hadoop workload.)

YARN Cluster Basics (Master/ResourceManager, Worker/NodeManager)

In a YARN cluster, there are two types of hosts:

  • The ResourceManager is the master daemon that communicates with the client, tracks resources on the cluster, and orchestrates work by assigning tasks to NodeManagers.
  • A NodeManager is a worker daemon that launches and tracks processes spawned on worker hosts.

Figure 2: Master host with ResourceManager and Worker hosts with NodeManager

YARN Configuration File

The YARN configuration file is an XML file that contains properties. This file is placed in a well-known location on each host in the cluster and is used to configure the ResourceManager and NodeManager. By default, this file is named yarn-site.xml. The basic properties in this file used to configure YARN are covered in the later sections.

YARN Requires a Global View

YARN currently defines two resources, vcores and memory. Each NodeManager tracks its own local resources and communicates its resource configuration to the ResourceManager, which keeps a running total of the cluster’s available resources. By keeping track of the total, the ResourceManager knows how to allocate resources as they are requested. (Vcore has a special meaning in YARN. You can think of it simply as a “usage share of a CPU core.” If you expect your tasks to be less CPU-intensive (sometimes called I/O-intensive), you can set the ratio of vcores to physical cores higher than 1 to maximize your use of hardware resources.)

Figure 3: ResourceManager global view of the cluster

Containers

Containers are an important YARN concept. You can think of a container as a request to hold resources on the YARN cluster. Currently, a container hold request consists of vcore and memory, as shown in Figure 4 (left).

Figure 4: Container as a hold (left), and container as a running process (right)

Once a hold has been granted on a host, the NodeManager launches a process called a task. The right side of Figure 4 shows the task running as a process inside a container. (Part 3 will cover, in more detail, how YARN schedules a container on a particular host.)

YARN Cluster Basics (Running Process/ApplicationMaster)

For the next section, two new YARN terms need to be defined:

  • An application is a YARN client program that is made up of one or more tasks (see Figure 5).
  • For each running application, a special piece of code called an ApplicationMaster helps coordinate tasks on the YARN cluster. The ApplicationMaster is the first process run after the application starts.

An application running tasks on a YARN cluster consists of the following steps:

  1. The application starts and talks to the ResourceManager for the cluster:

    Figure 5: Application starting up before tasks are assigned to the cluster

  2. The ResourceManager makes a single container request on behalf of the application:

    Figure 6: Application + allocated container on a cluster

  3. The ApplicationMaster starts running within that container:

    Figure 7: Application + ApplicationMaster running in the container on the cluster

  4. The ApplicationMaster requests subsequent containers from the ResourceManager that are allocated to run tasks for the application. Those tasks do most of the status communication with the ApplicationMaster allocated in Step 3):

    Figure 8: Application + ApplicationMaster + task running in multiple containers running on the cluster

  5. Once all tasks are finished, the ApplicationMaster exits. The last container is de-allocated from the cluster.
  6. The application client exits. (The ApplicationMaster launched in a container is more specifically called a managed AM. Unmanaged ApplicationMasters run outside of YARN’s control. Llama is an example of an unmanaged AM.)

MapReduce Basics

In the MapReduce paradigm, an application consists of Map tasks and Reduce tasks. Map tasks and Reduce tasks align very cleanly with YARN tasks.

 

Figure 9: Application + Map tasks + Reduce tasks

Putting it Together: MapReduce and YARN

Figure 10 illustrates how the map tasks and the reduce tasks map cleanly to the YARN concept of tasks running in a cluster.

Figure 10: Merged MapReduce/YARN Application Running on a Cluster

In a MapReduce application, there are multiple map tasks, each running in a container on a worker host somewhere in the cluster. Similarly, there are multiple reduce tasks, also each running in a container on a worker host.

Simultaneously on the YARN side, the ResourceManager, NodeManager, and ApplicationMaster work together to manage the cluster’s resources and ensure that the tasks, as well as the corresponding application, finish cleanly.

Conclusion

Summarizing the important concepts presented in this section:

  1. A cluster is made up of two or more hosts connected by an internal high-speed network. Master hosts are a small number of hosts reserved to control the rest of the cluster. Worker hosts are the non-master hosts in the cluster.
  2. In a cluster with YARN running, the master process is called the ResourceManager and the worker processes are called NodeManagers.
  3. The configuration file for YARN is named yarn-site.xml. There is a copy on each host in the cluster. It is required by the ResourceManager and NodeManager to run properly. YARN keeps track of two resources on the cluster, vcores and memory. The NodeManager on each host keeps track of the local host’s resources, and the ResourceManager keeps track of the cluster’s total.
  4. A container in YARN holds resources on the cluster. YARN determines where there is room on a host in the cluster for the size of the hold for the container. Once the container is allocated, those resources are usable by the container.
  5. An application in YARN comprises three parts:
    1. The application client, which is how a program is run on the cluster.
    2. An ApplicationMaster which provides YARN with the ability to perform allocation on behalf of the application.
    3. One or more tasks that do the actual work (runs in a process) in the container allocated by YARN.
  6. A MapReduce application consists of map tasks and reduce tasks.
  7. A MapReduce application running in a YARN cluster looks very much like the MapReduce application paradigm, but with the addition of an ApplicationMaster as a YARN requirement.

Next Time…

Part 2 will cover calculating YARN properties for cluster configuration. In the meantime, consider this further reading:

Ray Chiang is a Software Engineer at Cloudera.

Dennis Dawson is a Senior Technical Writer at Cloudera.

 

Facebooktwittergoogle_pluslinkedinmailFacebooktwittergoogle_pluslinkedinmail

9 responses on “Untangling Apache Hadoop YARN, Part 1: Cluster and YARN Basics

  1. trinity

    Hi,
    Great overview, thanks. Can you elaborate on what exactly a container is? What’s the relationship between a container and a yarn task, is task a subprocess of a container or what? How is the memory use of a container tracked and enforced? Thanks again.

  2. Ray Chiang

    Hi Trinity.

    Your questions really start delving into YARN internals. I’ll see if I can provide some kind of not-overly-technical summary here.

    1) When a container is “launched” on a NodeManager, there is a bunch of setup (localization, environment, etc.) that gets done before either the container or the YARN task can launch.

    2) Once setup is completed, the NodeManager launches a process that launches the actual YARN task. The parent process of the YARN task is used by YARN as a handle for managing the task.

    3) Using the handle in 2), YARN can track when a process goes over it’s container allocation for memory and will kill it.

    Parts 2) and 3) are part of what could be considered the “running” part of a container. This isn’t an important detail for cluster configuration (the intended purposes of this blog), which is why we didn’t go into too much detail.

    Some more technical detail is exposed through the example code in the “Writing YARN Applications” documentation on the Apache Hadoop site (http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html).

    I hope that answers your questions.

  3. Yann Barraud

    Nice post !!

    Could we have an exemple of Spark on YAN like the “Putting it Together: MapReduce and YARN” paragraph ?

  4. PRAVEEN KANUMARLAPUDI

    Thank you for very detailed explanation,
    can you correct me on my understanding on vcores? as per my understanding from this post vcores are amount of CPU allocated for a container, how does this got advantage over MRV1, is it like, vcores are very useful when multiple containers are running on the same worker can distribute CPU also among those and to minimize the waiting time for the jobs waiting for resources? or to run HIGH PRIORITY jobs more faster with more vcores…

  5. Ray Chiang

    I purposely avoided comparison to MRv1 for two reasons:
    1) Allocation in MRv1 was based on fixed allocation of map slots and reduce slots. For a single job, if you had many pending map tasks and few reduce tasks, then you would often have idle reduce slots and idle nodes waiting for jobs. Similarly for the reverse situation.
    2) There is no concept of running “non-MR” jobs on a MRv1 cluster (i.e. no YARN). Containers allow any YARN application to run on the cluster simultaneously with MR jobs (i.e. multi-tenancy).

    vcores and memory reservations allow the YARN scheduler to do a “best fit” for the resource requests on the cluster vs. what resources each node has available–that’s it. In general, you have three situations for a particular application running on YARN:

    A) If most tasks of an application use significantly less CPU and/or memory than the container request (e.g. “vcore=1, memory=2048M”), you are under-utilizing your cluster. In the vcore case, you will likely want to increase yarn.nodemanager.resource.vcores per host.
    B) If all of an application’s container requests are very close to what the task uses, then your cluster will run jobs fine.
    C) If most tasks of an application use significantly more CPU and/or memory than the container request, you are oversubscribing your cluster (and likely keeping hosts at high load). In the vcore case, you will likely want to decrease yarn.nodemanager.resource.vcores per host or increase the # of vcores used per container.

    Also, please read Part 2 and see if that clarifies the vcore concept for you.

  6. VicKatte

    Hello,

    Execellent blog – I have been trying to untangle the YARN threadball myself but still have several areas where things are not yet clear in my mind. I wonder if you could help answer them for me.

    1) How does YARN learn the information to be able to enforce data locality. This data is held in the Namenode and during a job submission, there is no communication between the client submitting the job and the Namenode, so how is it able to calculate input splits, for example. I understand input splits can also be calculated by the application master. Does this means that the application master communicates with the Namenode to learn about the location of the data that it will need in order to be able to make resource requests?

    2) Do running containers heartbeat their parent application masters? Is this communication supported natively by YARN or is it something that each application frame can provide specifically. In the MapReduce framework, the YarnChild is shown to provide status reports to the MR application master – is this something that is built into the MR application framework on YARN?

    The book by Arun C Murthy (Apache Hadoop YARN) discusses this on page 149, but it is not exactly clear if this is sopport natively, how it is done or how it can be done?

    Basically, does a running container communicates with any other components (NodeManager, ResourceManager, ApplicationMaster, etc) of the YARN system? If so, how? Which protocols are used for such communications?

    Thanks

    1. Ray Chiang

      First, let me qualify these answers a bit. This blog post is intended to
      provide just the right amount of information for configuring YARN. Your
      questions dig fairly deep into YARN internals for YARN developers. For
      these types of questions, there are currently four places to get an
      answer:

      A) The source code for Hadoop. I highly recommend downloading the Hadoop
      codebase and using IntelliJ as documented in the blog post
      https://blog.cloudera.com/blog/2014/06/how-to-create-an-intellij-idea-project-for-apache-hadoop/
      B) The developer lists for YARN and MapReduce listed at
      http://hadoop.apache.org/mailing_lists.html. I highly recommend using
      a search engine on the mailing list to find answers.
      C) The Apache Hadoop documentation http://hadoop.apache.org/docs/current/ and wiki http://wiki.apache.org/hadoop/FrontPage have a lot of information
      D) Books. I strongly recommend getting a complete understanding of
      “Hadoop: The Definitive Guide” before diving too much into “Apache
      Hadoop YARN”.

      That being said, here’s what I know at the moment.

      Also, you’re asking a lot of loaded questions here, which would take a
      great deal of in-depth explanations. I can provide what bits I know,
      which may help you get further along.

      1) It’s difficult to answer this question without fully explaining how a
      MapReduce job does it’s setup, runs the tasks, and how that links
      to YARN. Let’s break that down into the four most relevant
      pieces:

      a) MapReduce Input Splits

      When you run a “pi” job, you’ll see output like:

      15/12/14 21:39:00 INFO input.FileInputFormat: Total input paths to process : 2
      15/12/14 21:39:01 INFO mapreduce.JobSubmitter: number of splits:2

      So, some of the InputSplit work is done in FileInputFormat.java and
      JobSubmitter.java. Looking at the code a bit deeper, you can see
      the following:

      public abstract class FileInputFormat implements InputFormat {

      And looking at the parent interface InputFormat.java:

      InputSplit[] getSplits(JobConf job, int numSplits) throws IOException;

      So, all subclasses of InputFormats need to implement the getSplits()
      method, which is called by JobSubmitter. So, at this point the
      job has the information about the splits.

      b) MapReduce Job/MapTask

      Now that a job has it’s information about the splits, it must pass
      those to the individual MapTasks. Looking at JobImpl.java:

      private void createMapTasks(JobImpl job, long inputLength,
      TaskSplitMetaInfo[] splits) {

      So, at this level, each MapTask has information about its splits.

      c) MapReduce TaskAttempts

      As each MapTask runs, it is still the responsibility of each
      MapTask to keep MapTaskAttempts going until it succeeds (or hits
      some failure threshold). Looking at the constructor for
      TaskAttemptImpl:

      public TaskAttemptImpl(TaskId taskId, int i,
      EventHandler eventHandler,
      TaskAttemptListener taskAttemptListener, Path jobFile, int partition,
      JobConf conf, String[] dataLocalHosts,
      Token jobToken,
      Credentials credentials, Clock clock,
      AppContext appContext) {

      At this point, the MapTask split info has been converted into an
      array of dataLocalHosts. I’ll leave it to you to find exactly
      where.

      d) Scheduling Containers via YARN

      YARN does the key container allocation in AbstractYarnScheduler.java
      in the method:

      Allocation allocate(ApplicationAttemptId appAttemptId,
      List ask, List release,
      List blacklistAdditions, List blacklistRemovals,
      List increaseRequests,
      List decreaseRequests);

      From here, you’ll notice the ResourceRequest class. Looking at that
      class, you can see:

      public static ResourceRequest newInstance(Priority priority, String hostName,
      Resource capability, int numContainers, boolean relaxLocality,
      String labelExpression) {

      So, each ResourceRequest contains a hostName, which is presumably
      the same as that coming from the TaskAttemptImpl.

      2) Almost all communication between components uses RPC calls via
      Google’s Protocol Buffers. This is mentioned in
      https://wiki.apache.org/hadoop/ProtocolBuffers. I’m not going into
      detail here, but I highly recommend getting a good understanding
      of Protocol Buffers and/or looking at the various *Heartbeat*.java files
      that exist in the source code and go exploring from there.

  7. Abhi

    As per my understanding, for a single application client , a single application master spawns up. If my understanding is wrong then can you provide me some more details about application master ?