What I Learned During My Summer Internship at Cloudera, Part 2

Categories: Cloudera Life YARN

The guest post below is from Wei Yan, a 2013 summer intern at Cloudera. In this post, he helpfully describes his personal projects from this summer. Thanks for your contributions, Wei!

As a Ph.D. student at Vanderbilt University, I work on the Apache Hadoop MapReduce framework, with a focus on optimizing data intensive computing tasks. Although I’m very familiar with MapReduce itself, my curiosity about the use cases for MapReduce and where it generally fits in the Big Data are drew me to Cloudera for the summer of 2013.

At Cloudera, I mainly worked on two projects: a Hadoop YARN Scheduler Load Simulator and a testing framework called Hadoop MiniKDC. In the remainder of this post, I’ll describe my experiences with each and what I learned from them.

Predicting YARN Scheduler Performance

My first project was to build a simulator that evaluates Hadoop YARN scheduling algorithms. YARN is a platform for hosting applications in Hadoop 2, and provides resource management for various data processing frameworks, such as Hadoop MapReduce.

The YARN scheduler is a major component in Hadoop that matches available resources with resource requests. It has multiple implementations such as FIFO and Capacity and Fair schedulers, with each implementation driving scheduling decisions based on many different factors, such as fairness, capacity guarantee, and so on. Thus, selecting an appropriate implementation and configuring it correctly brings many benefits (better system throughput and resource utilization, for example). Conversely, a poor selection may hurt system performance. Thus it is very important to evaluate scheduler implementation and configuration well before it is deployed in a production cluster.

Unfortunately, currently it is nontrivial to evaluate a scheduler implementation. Evaluating in a real cluster is always costly and time consuming, and it is also very hard for researchers to find a sufficiently large cluster. Hence, a simulator that can predict how well a scheduler implementation works for some specific workload would be quite useful.

A simulator that can predict how well a YARN scheduler implementation works would be quite useful.

The objective of this project was to build a YARN Scheduler Load Simulator to simulate large-scale YARN clusters and application loads in a single machine (YARN-1021). This would be invaluable in furthering YARN by providing a tool for researchers and developers to prototype new scheduler features and predict their behavior and performance with a reasonable amount of confidence, thereby aiding rapid innovation.

The first challenge here was to determine the right level of abstraction — that is, to determine which components from Hadoop I would execute and which components I would simulate. It is unrealistic to launch real Node Managers (NMs) and Application Masters (AMs) in the simulator; as each NM and AM requires considerable resources (i.e., CPU and memory), launching several NMs and AMs would quickly use up all the available resources in a single machine. Thus, I needed to define an appropriate abstraction level that could support large-scale clusters and application loads in a single machine, without affecting the scheduler algorithm semantics and behaviors.

The simulator exercises the real YARN resource manager and removes the network factor by simulating NMs and AMs via handling and dispatching NM/AM heartbeat events from within the same JVM. By simulating NMs and AMs, the simulator can limit its required CPU and memory. Using the real resource manager where the scheduler algorithm resides, the simulator can largely maintain the original scheduler running semantics.

Defining the metrics used for scheduler evaluation was another major challenge because tracking these metrics should not bring much overhead to the scheduler’s own execution. Furthermore, these metrics should provide enough information for developers and researchers to evaluate a scheduler algorithm.

Based on the above, the simulator provides three types of metrics, including:

  • Resource usage for the whole cluster and each queue, which can be utilized to configure cluster and queue capacity
  • The detailed application execution trace (recorded in relation to simulated time), which can be analyzed to understand/validate the scheduler behavior (individual jobs turn around time, throughput, fairness, capacity guarantee)
  • Several key metrics of scheduler algorithm, such as time cost of each scheduler operation (allocate, handle, and so on), which can be utilized by Hadoop developers to find the code performance bottlenecks and scalability limits
    Furthermore, I built a web app inside the simulator that enables users to track these metrics in real time when the simulator is running.

Working on the simulator was a great opportunity for me to understand in detail the design and implementation of the YARN framework, including event-trigger architecture, the decoupling of resource management and application, and so on. Another lesson was how to actually deliver the project. The first demo did not have the online tracking service; users could only analyze the metrics hours after the simulator finished. Eventually, I designed and implemented a web app for users to track the simulator in real time.

The video below presents a demo of the simulator and the real-time tracking feature:

Making Kerberos Security Testing Easier

My second project was to build a “mini KDC” (HADOOP-9848) for Kerberos security testing in the Hadoop ecosystem. Currently, lots of security test cases in Hadoop are disabled, as they require running a Kerberos Key Distribution Center (KDC). Setting up a KDC takes time and that process is hard to automate.

Working on the simulator was a great opportunity to learn the design and implementation of YARN.

The MiniKDC builds an embedded KDC using Apache Directory Server, and allows the creation of principals and keytabs on the fly. Developers can start and stop a KDC inside their Java code directly. It fulfills a detailed requirement in the Hadoop project, and can be integrated with several other projects. An important lesson for me here was how to design APIs. Simple but functional APIs can help developers easily integrate MiniKDC with other projects.

Learning How to Contribute

At Cloudera, I also learned how to get involved in, and contribute to, an open source project like Hadoop. I started with a simple HDFS JIRA to remove an unnecessary warning message from the logs. This JIRA helped me to get familiar with the development protocol and whole lifecycle, from creating a Hadoop issue, building Hadoop, testing my changes, and posting a patch for others to review.

After that, I worked on several JIRAs in different areas of Hadoop (Common, YARN, MapReduce). Working on these JIRAs gave me the opportunity to join and collaborate with the Hadoop open source community, where I can easily publish my own work and gain useful feedback from other contributors. Another benefit of interacting with the community is that you can have discussions with people from different companies, which is a great opportunity to collect requirements from different use case scenarios.

Back to School

All in all, I had a great time working on these projects, and interacting with all the brilliant people I’ve had the pleasure to meet at Cloudera. I have learned so much in this summer. I saw how a software company operates and learned the art and practice of open source software development. Cloudera has provided me with a more complete picture of the diversity of industry’s data and needs, which will also help me with my research when I return to school.

Read about 2013 summer intern Catherine Ray’s experiences here!