Category Archives: MapReduce

Migrating to MapReduce 2 on YARN (For Users)

Categories: General Hadoop MapReduce YARN

In Apache Hadoop 2, YARN and MapReduce 2 (MR2) are long-needed upgrades for scheduling, resource management, and execution in Hadoop. At their core, the improvements separate cluster resource management capabilities from MapReduce-specific logic. They enable Hadoop to share resources dynamically between MapReduce and other parallel processing frameworks, such as Cloudera Impala; allow more sensible and finer-grained resource configuration for better cluster utilization; and permit Hadoop to scale to accommodate more and larger jobs.

Read more

Meet the Project Founder: Josh Wills

Categories: Data Science Hadoop MapReduce Meet the Engineer

In this installment of “Meet the Project Founder,” we speak with Josh Wills (@josh_wills), Cloudera’s Senior Director of Data Science and founder of Apache Crunch and Cloudera ML.

What led you to your project idea(s)?
When I first started at Cloudera in 2011, I had a fairly vague job description, no real responsibilities, and wasn’t all that familiar with the Apache Hadoop stack, so I started working on various pet projects in order to learn more about the tools and the use cases in domains like healthcare and energy.

Read more

How-to: Use Eclipse with MapReduce in Cloudera’s QuickStart VM

Categories: How-to MapReduce QuickStart VM

One of the common questions I get from students and developers in my classes relates to IDEs and MapReduce: How do you create a MapReduce project in Eclipse and then debug it?

To answer that question, I have created a screencast showing you how, using Cloudera’s QuickStart VM. The QuickStart VM helps developers get started writing MapReduce code without having to worry about software installs and configuration. Everything is installed and ready to go. You can download the image type that corresponds to your preferred virtualization platform.

Read more

Improvements in the Hadoop YARN Fair Scheduler

Categories: Hadoop MapReduce YARN

Starting in CDH 4.2, YARN/MapReduce 2 (MR2) includes an even more powerful Fair Scheduler. In addition to doing nearly all that it could do in MapReduce 1 (MR1), the YARN Fair Scheduler can schedule non-MapReduce jobs, schedule based on fine-grained memory instead of slots, and support hierarchical queues. In this post, you’ll learn what the Fair Scheduler’s role is and how it fulfills it, what it means to be a YARN “scheduler,” and dive into its new features and how to get them running on your cluster.

Read more

Cloudera ML: New Open Source Libraries and Tools for Data Scientists

Categories: Community Data Science General Mahout MapReduce Tools

Editor’s note (12/19/2013): Cloudera ML has been merged into the Oryx project. The information below is still valid though.

Last month, Apache Crunch became the fifth project (along with Sqoop, Flume, Bigtop, and MRUnit) to go from Cloudera’s github repository through the Apache Incubator and on to graduate as a top-level project within the Apache Software Foundation. As the founder of the project and a newly minted Apache VP,

Read more