Apache Hadoop Beyond MapReduce, Part 1: Introducing Kitten
This week, a team of researchers at Google will be presenting a paper describing a system they developed that can learn to identify objects, including the faces of humans and cats, from an extremely large corpus of unlabeled training data. It is a remarkable accomplishment, both in terms of the system’s performance (a 70% improvement over the prior state-of-the-art) and its scale: the system runs on over 16,000 CPU cores and was trained on 10 million 200×200 pixel images extracted from YouTube videos.
Doug Cutting has described Apache Hadoop as “the kernel of a distributed operating system.” Until recently, Hadoop has been an operating system that was optimized for running a certain class of applications: the ones that could be structured as a short sequence of MapReduce jobs. Although MapReduce is the workhorse programming framework for distributed data processing, there are many difficult and interesting problems– including combinatorial optimization problems, large-scale graph computations, and machine learning models that identify pictures of cats– that can benefit from a more flexible execution environment.
Hadoop 2.0 introduced a substantial re-design of the core resource scheduling and task tracking system that will allow developers to create entirely new classes of applications for Hadoop. Cloudera’s Ahmed Radwan has written an excellent overview of the architecture of the new resource scheduling system, known as YARN. Hadoop’s open-source foundation and its broad adoption by industry, academia, and government labs means that, for the first time in history, developers can assume that a common platform for distributed computing will be available at organizations all over the world, and that there will be a market for applications that take advantage of that common platform to solve problems at scales that have never been considered before.
Kitten: For Developers Who Want to Play with YARN
Hadoop 2.0 ships with an example YARN application called Distributed Shell, which lets a user create some number of tasks on a Hadoop cluster and run a shell command in each of them. Distributed Shell is intended to be a simple example to illustrate what is required to create a YARN application. Although the YARN API is powerful and flexible, that power and flexibility comes at a cost: the simple Distributed Shell application consists of over 1600 lines of Java code that are primarily concerned with configuring parameters and executing RPCs to communicate with YARN’s resource scheduler.
In our experience, most distributed applications have a similar lifecycle: they are initialized with a set of resources, they run for awhile and need to be monitored for results and/or failures, and they require some cleanup to release the resources they were using when they are finished executing. We implemented these lifecycle patterns in Kitten, a set of Java libraries and Lua-based configuration files that handle configuring, launching, and monitoring YARN applications, allowing developers to spend most of their time writing the logic of an application, not its configuration. In the same way that Java libraries like Crunch encode common MapReduce patterns into tools that allow developers to solve problems without needing to worry about low-level implementation details, Kitten implements a series of patterns for configuring and managing the typical lifecycle of YARN applications without requiring developers to know every detail of the YARN APIs.
Free as in Kittens
The source code for Kitten is available on GitHub, along with surprisingly ample documentation, all of which is released under the Apache 2.0 License. Kitten was developed against the experimental YARN module that is distributed with CDH4, and is intended for use by developers who are interested in developing distributed systems against the YARN APIs; we do not recommend running production jobs (MapReduce or otherwise) under YARN at the present time.
In the second part of this series, we will introduce BranchReduce, a framework for building distributed branch-and-bound solvers that we developed using Kitten.