Cloudera Engineering Blog · YARN Posts

How-to: Secure YARN Containers with Cloudera Navigator Encrypt

Learn how Cloudera Navigator Encrypt bring data security to YARN containers.

With the introduction of transparent data encryption in HDFS, we are now a big step closer toward a secure platform in the Apache Hadoop world. However, there are still gaps in the platform, including how YARN and its applications manage their cache. In this post, I’ll explain how Cloudera Navigator Encrypt fills that particular gap.

Use Case

Apache Spark Resource Management and YARN App Models

A concise look at the differences between how Spark and MapReduce manage cluster resources under YARN

The most popular Apache YARN application after MapReduce itself is Apache Spark. At Cloudera, we have worked hard to stabilize Spark-on-YARN (SPARK-1101), and CDH 5.0.0 added support for Spark on YARN clusters.

How Apache Hadoop YARN HA Works

Thanks to recent work upstream, YARN is now a highly available service. This post explains its architecture and configuration details.

YARN, the next-generation compute and resource management framework in Apache Hadoop, until recently had a single point of failure: the ResourceManager, which coordinates work in a YARN cluster. With planned (upgrades) or unplanned (node crashes) events, this central service, and YARN itself, could become unavailable.

Apache Hadoop YARN: Avoiding 6 Time-Consuming "Gotchas"

Understanding some key differences between MR1 and MR2/YARN will make your migration much easier.

Here at Cloudera, we recently finished a push to get Cloudera Enterprise 5 (containing CDH 5.0.0 + Cloudera Manager 5.0.0) out the door along with more than 100 partner certifications.

Hello, Apache Hadoop 2.4.0

The community has voted to release Apache Hadoop 2.4.0.

Hadoop 2.4.0 includes myriad improvements to HDFS and MapReduce, including (but not limited to):

What’s New in Cloudera Manager 5?

Learn the new features and enhancements in Cloudera Manager 5, including support for YARN, management of third-party apps and frameworks, and more.

The response to the Oct. 2013 release of Cloudera Enterprise 5 Beta has been overwhelming, and Cloudera is busily working closely with several customers to incorporate their feedback.

Managing Multiple Resources in Hadoop 2 with YARN

An overview of some of Cloudera’s contributions to YARN that help support management of multiple resources, from multi resource scheduling in the Fair Schedule to node-level enforcement

As Apache Hadoop become ubiquitous, it is becoming more common for users to run diverse sets of workloads on Hadoop, and these jobs are more likely to have different resource profiles. For example, a MapReduce distcp job or Cloudera Impala query that does a simple scan on a large table may be heavily disk-bound and require little memory. Or, an Apache Spark (incubating) job executing an iterative machine-learning algorithm with complex updates may wish to store the entire dataset in memory and use spurts of CPU to perform complex computation on it.

Migrating to MapReduce 2 on YARN (For Operators)

Cloudera Manager lets you add a YARN service in the same way you would add any other Cloudera Manager-managed service.

In Apache Hadoop 2, YARN and MapReduce 2 (MR2) are long-needed upgrades for scheduling, resource management, and execution in Hadoop. At their core, the improvements separate cluster resource management capabilities from MapReduce-specific logic. They enable Hadoop to share resources dynamically between MapReduce and other parallel processing frameworks, such as Cloudera Impala; allow more sensible and finer-grained resource configuration for better cluster utilization; and permit Hadoop to scale to accommodate more and larger jobs.

Migrating to MapReduce 2 on YARN (For Users)

In Apache Hadoop 2, YARN and MapReduce 2 (MR2) are long-needed upgrades for scheduling, resource management, and execution in Hadoop. At their core, the improvements separate cluster resource management capabilities from MapReduce-specific logic. They enable Hadoop to share resources dynamically between MapReduce and other parallel processing frameworks, such as Cloudera Impala; allow more sensible and finer-grained resource configuration for better cluster utilization; and permit Hadoop to scale to accommodate more and larger jobs.

In this post, users of CDH (Cloudera’s distribution of Hadoop and related projects) who program MapReduce jobs will get a guide to the architectural and user-facing differences between MapReduce 1 (MR1) and MR2. (MR2 is the default processing framework in CDH 5, although MR1 will continue to be supported.) Operators/administrators can read a similar post designed for them here.

Terminology and Architecture

Apache Hadoop 2 is Here and Will Transform the Ecosystem

The release of Apache Hadoop 2, as announced today by the Apache Software Foundation, is an exciting one for the entire Hadoop ecosystem.

Cloudera engineers have been working hard for many months with the rest of the vast Hadoop community to ensure that Hadoop 2 is the best it can possibly be, for the users of Cloudera’s platform as well as all Hadoop users generally. Hadoop 2 contains many major advances, including (but not limited to):

Writing Hadoop Programs That Work Across Releases

In a fast-moving project like Apache Hadoop, there are always exciting new features introduced in each release. While it is tempting to make the most of these new features by upgrading to the latest release, users are often concerned about their code continuing to run.

In this post, you’ll get an overview of the the Hadoop API annotations and compatibility policies. Hadoop annotates specific APIs to be safe for use by end-users. By using these APIs, users can ensure their code works across a set of releases and be aware of what releases it might not work against.

Hadoop Releases and Compatibility

What I Learned During My Summer Internship at Cloudera, Part 2

The guest post below is from Wei Yan, a 2013 summer intern at Cloudera. In this post, he helpfully describes his personal projects from this summer. Thanks for your contributions, Wei!

As a Ph.D. student at Vanderbilt University, I work on the Apache Hadoop MapReduce framework, with a focus on optimizing data intensive computing tasks. Although I’m very familiar with MapReduce itself, my curiosity about the use cases for MapReduce and where it generally fits in the Big Data are drew me to Cloudera for the summer of 2013.

Improvements in the Hadoop YARN Fair Scheduler

Starting in CDH 4.2, YARN/MapReduce 2 (MR2) includes an even more powerful Fair Scheduler. In addition to doing nearly all that it could do in MapReduce 1 (MR1), the YARN Fair Scheduler can schedule non-MapReduce jobs, schedule based on fine-grained memory instead of slots, and support hierarchical queues. In this post, you’ll learn what the Fair Scheduler’s role is and how it fulfills it, what it means to be a YARN “scheduler,” and dive into its new features and how to get them running on your cluster.

YARN/MR2 vs. MR1

YARN uses an updated terminology to reflect that it no longer just manages resources for MapReduce. From YARN’s perspective, a MapReduce job is an application. YARN schedules containers for map and reduce tasks to live in. What was referred to as pools in the MR1 Fair Scheduler has been updated to queue for consistency with the capacity scheduler. An excellent and deeper explanation is available here.

How Does it Work?

Apache Hadoop 2.0.3-alpha Released

Last week the Apache Hadoop PMC voted to release Apache Hadoop 2.0.3-alpha, the latest in the Hadoop 2 release series. This release fixes over 500 issues (covering the Common, HDFS, MapReduce and YARN sub-projects) since the 2.0.2-alpha release in October last year. In addition to bug fixes and general improvements the more noteworthy changes include:

Apache Hadoop in 2013: The State of the Platform

For several good reasons, 2013 is a Happy New Year for Apache Hadoop enthusiasts.

In 2012, we saw continued progress on developing the next generation of the MapReduce processing framework (MRv2), work that will bear fruit this year. HDFS experienced major progress toward becoming a lights-out, fully enterprise-ready distributed filesystem with the addition of high availability features and increased performance. And a hint of the future of the Hadoop platform was provided with the Beta release of Cloudera Impala, a real-time query engine for analytics across HDFS and Apache HBase data.

MR2 and YARN Briefly Explained

With CDH4 onward, the Apache Hadoop component introduced two new terms for Hadoop users to wonder about: MR2 and YARN. Unfortunately, these terms are mixed up so much that many people are confused about them. Do they mean the same thing, or not?

This post aims to clarify these two terms.

What is YARN?