Tag Archives: hadoop mapreduce

How-to: Translate from MapReduce to Apache Spark

Categories: How-to MapReduce Spark

The key to getting the most out of Spark is to understand the differences between its RDD API and the original Mapper and Reducer API.

Venerable MapReduce has been Apache Hadoop‘s work-horse computation paradigm since its inception. It is ideal for the kinds of work for which Hadoop was originally designed: large-scale log processing, and batch-oriented ETL (extract-transform-load) operations.

As Hadoop’s usage has broadened,

Read more

How-to: Convert Existing Data into Parquet

Categories: Hadoop Parquet

Learn how to convert your data to the Parquet columnar format to get big performance gains.

Using a columnar storage format for your data offers significant performance advantages for a large subset of real-world queries. (Click here for a great introduction.)

Last year, Cloudera, in collaboration with Twitter and others, released a new Apache Hadoop-friendly, binary, columnar file format called Parquet. (Parquet was recently proposed for the ASF Incubator.) In this post,

Read more

Native Parquet Support Comes to Apache Hive

Categories: Hive Impala Parquet

Bringing Parquet support to Hive was a community effort that deserves congratulations!

Previously, this blog introduced Parquet, an efficient ecosystem-wide columnar storage format for Apache Hadoop. As discussed in that blog post, Parquet encodes data extremely efficiently and as described in Google’s original Dremel paper. (For more technical details on the Parquet format read Dremel made simple with Parquet, or go directly to the open and community-driven Parquet Format specification.)

Before discussing the Parquet Hive integration,

Read more

This Month in the Ecosystem (November 2013)

Categories: Community Hadoop

Welcome to our fifth edition of “This Month in the Ecosystem,” a digest of highlights from November 2013 (never intended to be comprehensive; for completeness, see the excellent Hadoop Weekly).

With the holidays upon us, the news in November was sparse. Even so, the ecosystem never stops churning!

  • Continuuity Weave was Proposed as an Apache Incubator Project
    Weave, an effort to make building new apps on top of YARN much easier for mainstream developers,

Read more

What I Learned During My Summer Internship at Cloudera, Part 2

Categories: Cloudera Life YARN

The guest post below is from Wei Yan, a 2013 summer intern at Cloudera. In this post, he helpfully describes his personal projects from this summer. Thanks for your contributions, Wei!

As a Ph.D. student at Vanderbilt University, I work on the Apache Hadoop MapReduce framework, with a focus on optimizing data intensive computing tasks. Although I’m very familiar with MapReduce itself, my curiosity about the use cases for MapReduce and where it generally fits in the Big Data are drew me to Cloudera for the summer of 2013.

Read more