Cloudera Engineering Blog · MapReduce Posts

Apache Hive on Apache Spark: The First Demo

The community effort to make Apache Spark an execution engine for Apache Hive is making solid progress.

Apache Spark is quickly becoming the programmatic successor to MapReduce for data processing on Apache Hadoop. Over the course of its short history, it has become one of the most popular projects in the Hadoop ecosystem, and is now supported by multiple industry vendors—ensuring its status as an emerging standard.

How-to: Translate from MapReduce to Apache Spark

The key to getting the most out of Spark is to understand the differences between its RDD API and the original Mapper and Reducer API.

Venerable MapReduce has been Apache Hadoop‘s work-horse computation paradigm since its inception. It is ideal for the kinds of work for which Hadoop was originally designed: large-scale log processing, and batch-oriented ETL (extract-transform-load) operations.

How-to: Use Parquet with Impala, Hive, Pig, and MapReduce

The CDH software stack lets you use your tool of choice with the Parquet file format – - offering the benefits of columnar storage at each phase of data processing. 

An open source project co-founded by Twitter and Cloudera, Parquet was designed from the ground up as a state-of-the-art, general-purpose, columnar file format for the Apache Hadoop ecosystem. In particular, Parquet has several features that make it highly suited to use with Cloudera Impala for data warehouse-style operations:

The Truth About MapReduce Performance on SSDs

Cost-per-performance, not cost-per-capacity, turns out to be the better metric for evaluating the true value of SSDs.

In the Big Data ecosystem, solid-state drives (SSDs) are increasingly considered a viable, higher-performance alternative to rotational hard-disk drives (HDDs). However, few results from actual testing are available to the public.

Getting MapReduce 2 Up to Speed

Thanks to the improvements described here, CDH 5 will ship with a version of MapReduce 2 that is just as fast (or faster) than MapReduce 1.

Performance fixes are tiny, easy, and boring, once you know what the problem is. The hard work is in putting your finger on that problem: narrowing, drilling down, and measuring, measuring, measuring.

Write MapReduce Jobs in Idiomatic Clojure with Parkour

Thanks to Marshall Bockrath-Vandegrift of advanced threat detection/malware company (and CDH user) Damballa for the following post about his Parkour project, which offers libraries for writing MapReduce jobs in Clojure. Parkour has been tested (but is not supported) on CDH 3 and CDH 4.

Clojure is Lisp-family functional programming language which targets the JVM. On the Damballa R&D team, Clojure has become the language of choice for implementing everything from web services to machine learning systems. One of Clojure’s key features for us is that it was designed from the start as an explicitly hosted language, building on rather than replacing the semantics of its underlying platform. Clojure’s mapping from language features to JVM implementation is frequently simpler and clearer even than Java’s.

Putting Spark to Use: Fast In-Memory Computing for Your Big Data Applications

Our thanks to Databricks, the company behind Apache Spark (incubating), for providing the guest post below. Cloudera and Databricks recently announced that Cloudera will distribute and support Spark in CDH. Look for more posts describing Spark internals and Spark + CDH use cases in the near future.

Migrating to MapReduce 2 on YARN (For Operators)

Cloudera Manager lets you add a YARN service in the same way you would add any other Cloudera Manager-managed service.

In Apache Hadoop 2, YARN and MapReduce 2 (MR2) are long-needed upgrades for scheduling, resource management, and execution in Hadoop. At their core, the improvements separate cluster resource management capabilities from MapReduce-specific logic. They enable Hadoop to share resources dynamically between MapReduce and other parallel processing frameworks, such as Cloudera Impala; allow more sensible and finer-grained resource configuration for better cluster utilization; and permit Hadoop to scale to accommodate more and larger jobs.

Migrating to MapReduce 2 on YARN (For Users)

In Apache Hadoop 2, YARN and MapReduce 2 (MR2) are long-needed upgrades for scheduling, resource management, and execution in Hadoop. At their core, the improvements separate cluster resource management capabilities from MapReduce-specific logic. They enable Hadoop to share resources dynamically between MapReduce and other parallel processing frameworks, such as Cloudera Impala; allow more sensible and finer-grained resource configuration for better cluster utilization; and permit Hadoop to scale to accommodate more and larger jobs.

In this post, users of CDH (Cloudera’s distribution of Hadoop and related projects) who program MapReduce jobs will get a guide to the architectural and user-facing differences between MapReduce 1 (MR1) and MR2. (MR2 is the default processing framework in CDH 5, although MR1 will continue to be supported.) Operators/administrators can read a similar post designed for them here.

Terminology and Architecture

Meet the Project Founder: Josh Wills

In this installment of “Meet the Project Founder,” we speak with Josh Wills (@josh_wills), Cloudera’s Senior Director of Data Science and founder of Apache Crunch and Cloudera ML.

What led you to your project idea(s)?
When I first started at Cloudera in 2011, I had a fairly vague job description, no real responsibilities, and wasn’t all that familiar with the Apache Hadoop stack, so I started working on various pet projects in order to learn more about the tools and the use cases in domains like healthcare and energy.

Older Posts