Cloudera Blog · MapReduce Posts
In June 2012, Eli Collins (@elicollins), from Cloudera’s Platforms team, led a session at QCon New York 2012 on the subject “Introducing Apache Hadoop: The Modern Data Operating System.” During the conference, the QCon team had an opportunity to interview Eli about several topics, including important things to know about CDH4, main differences between MapReduce 1.0 and 2.0, Hadoop use cases, and more. It’s a great primer for people who are relatively new to Hadoop.
You can catch the full interview (video and transcript versions) here.
We are happy to announce the general availability of CDH3 update 5. This update is a maintenance release of CDH3 platform and provides a considerable amount of bug-fixes and stability enhancements. Alongside these fixes, we have also included a few new features, most notable of which are the following:
In Building and Deploying MR2, we presented a brief introduction to MapReduce in Apache Hadoop 0.23 and focused on the steps to setup a single-node cluster. In MapReduce 2.0 in Hadoop 0.23, we discussed the new architectural aspects of the MapReduce 2.0 design. This blog post highlights the main issues to consider when migrating from MapReduce 1.0 to MapReduce 2.0. Note that both MapReduce 1.0 and MapReduce 2.0 are included in CDH4.
It is important to note that, at the time of writing this blog post, MapReduce 2.0 is still Alpha, and it is not recommended to use it in production.
In the rest of this post, we shall first discuss the Client API, followed by configurations and testing considerations, and finally commenting on the new changes related to the Job History Server and Web Servlets. We will use the terms MR1 and MR2 to refer to MapReduce in Hadoop 1.0 and Hadoop 2.0, respectively.
This posted was originally posted to the Apache Software Foundation MRUnit blog.
The Apache MRUnit team has graduated from the Apache Incubator to an Apache TLP (Top Level Project)! MRUnit is a Java library that helps developers unit test Apache Hadoop MapReduce jobs. Unit testing is a technique for improving project quality and reducing overall costs by writing a small amount of code that can automatically verify the software you write performs as intended. This is considered a best practice in software development since it helps identify defects early, before they're deployed to a production system.
In its monthly meeting in May of 2012, the board of Apache Software Foundation (ASF) resolved to grant a Top-Level Project status to Apache MRUnit, thus graduating it from the Incubator. This is a significant milestone in the life of MRUnit, which has come a long way since its inception as a Hadoop Contrib project in HADOOP-5518 contributed by Aaron Kimball.
In Building and Deploying MR2 we presented a brief introduction to MapReduce in Apache Hadoop 0.23 and focused on the steps to set up a single-node cluster. This blog provides developers with architectural details of the new MapReduce design.
Apache Hadoop 0.23 has major improvements over previous releases. Here are a few highlights on the MapReduce front; note that there are also major HDFS improvements, which are out of scope of this post.
MapReduce 2.0 (a.k.a. MRv2 or YARN):
The new architecture divides the two major functions of the JobTracker – resource management and job life-cycle management – into separate components:
Last month at the Web 2.0 Summit in San Francisco, Cloudera CEO Mike Olson presented some work the Cloudera Data Science Team did to analyze adverse drug events. We decided to share more detail about this project because it demonstrates how to use a variety of open-source tools – R, Gephi, and Cloudera’s Distribution Including Apache Hadoop (CDH) – to solve an old problem in a new way.
Background: Adverse Drug Events
An adverse drug event (ADE) is an unwanted or unintended reaction that results from the normal use of one or more medications. The consequences of ADEs range from mild allergic reactions to death, with one study estimating that 9.7% of adverse drug events lead to permanent disability. Another study showed that each patient who experiences an ADE remains hospitalized for an additional 1-5 days and costs the hospital up to $9,000.
Some adverse drug events are caused by drug interactions, where two or more prescription or over-the-counter (OTC) drugs taken together leads to an unexpected outcome. As the population ages and more patients are treated for multiple health conditions, the risk of ADEs from drug interactions increases. In the United States, roughly 4% of adults older than 55 are at risk for a major drug interaction.
A number of architectural changes have been added to Hadoop MapReduce. The new MapReduce system is called MR2 (AKA MR.next). The first release version to include these changes will be Apache Hadoop 0.23.
A key change in the new architecture is the disappearance of the centralized JobTracker service. Previously, the JobTracker was responsible for provisioning the resources across the whole cluster, in addition to managing the life cycle of all submitted MapReduce applications; this typically included starting, monitoring and retrying the applications individual tasks. Throughout the years and from a practical perspective, the Hadoop community has acknowledged the problems that inherently exist in this functionally aggregated design (See MAPREDUCE-279).
In MR2, the JobTracker aggregated functionality is separated across two new components:
- Central Resource Manager (RM): Management of resources in the cluster.
- Application Master (AM): Management of the life cycle of an application and its tasks. Think of the AM as a per-application JobTracker.
The Development track at Hadoop World is a technical deep dive dedicated to discussion about Apache Hadoop and application development for Apache Hadoop. You will hear committers, contributors and expert users from various Hadoop projects discuss the finer points of building applications with Hadoop and the related ecosystem. The sessions will touch on foundational topics such as HDFS, HBase, Pig, Hive, Flume and other related technologies. In addition, speakers will address key development areas including tools, performance, bringing the stack together and testing the stack. Sessions in this track are for developers of all levels who want to learn more about upcoming features and enhancements, new tools, advanced techniques and best practices.
Building Web Analytics Processing on Hadoop at CBS Interactive
Michael Sun, CBS Interactive
As a data scientist at Cloudera, I work with customers across a wide range of industries that use Apache Hadoop to solve their business problems. Many of the solutions we create involve multi-stage pipelines of MapReduce jobs that join, clean, aggregate, and analyze enormous amounts of data. When working with log files or relational database tables, we use high-level tools like Apache Pig and Apache Hive for their convenient and powerful support for creating pipelines over structured and semi-structured records.
As Hadoop has spread from web companies to other industries, the variety of data that is stored in HDFS has expanded dramatically. Hadoop clusters are being used to process satellite images, time series data, audio files, and seismograms. These formats are not a natural fit for the data schemas imposed by Pig and Hive, in the same way that structured binary data in a relational database can be a bit awkward to work with. For these use cases, we either end up writing large, custom libraries of user-defined functions in Pig or Hive, or simply give up on our high-level tools and go back to writing MapReduces in Java. Either of these options is a serious drain on developer productivity.
Today, we’re pleased to introduce Crunch, a Java library that aims to make writing, testing, and running MapReduce pipelines easy, efficient, and even fun. Crunch’s design is modeled after Google’s FlumeJava, focusing on a small set of simple primitive operations and lightweight user-defined functions that can be combined to create complex, multi-stage pipelines. At runtime, Crunch compiles the pipeline into a sequence of MapReduce jobs and manages their execution.
Continuing with our practice from Cloudera’s Distribution Including Apache Hadoop v2 (CDH2), our goal is to provide regular (quarterly), predictable updates to the generally available release of our open source distribution. For CDH3 the first such update is available today, approximately 3 months from when CDH3 went GA.
For those of you who are recent Cloudera users, here is a refresh on our update policy: