Tag Archives: MapReduce

Architectural Patterns for Near Real-Time Data Processing with Apache Hadoop

Categories: Data Ingestion Flume Hadoop HBase Kafka Spark

Evaluating which streaming architectural pattern is the best match to your use case is a precondition for a successful production deployment.

The Apache Hadoop ecosystem has become a preferred platform for enterprises seeking to process and understand large-scale data in real time. Technologies like Apache Kafka, Apache Flume, Apache Spark, Apache Storm, and Apache Samza are increasingly pushing the envelope on what is possible. It is often tempting to bucket large-scale streaming use cases together but in reality they tend to break down into a few different architectural patterns,

Read more

How-to: Get Started with CDH on OpenStack with Sahara

Categories: Cloud Guest

The recent OpenStack Kilo release adds many features to the Sahara project, which provides a simple means of provisioning an Apache Hadoop (or Spark) cluster on top of OpenStack. This how-to, from Intel Software Engineer Wei Ting Chen, explains how to use the Sahara CDH plugin with this new release.

[Ed note: Cloudera does not provide support for Cloudera Enterprise on OpenStack. Cloudera customers should contact their representative with questions.]

Prerequisites

This how-to assumes that OpenStack is already installed.

Read more

Working with Apache Spark: Or, How I Learned to Stop Worrying and Love the Shuffle

Categories: Guest Spark

Our thanks to Ilya Ganelin, Senior Data Engineer at Capital One Labs, for the guest post below about his hard-earned lessons from using Spark.

I started using Apache Spark in late 2014, learning it at the same time as I learned Scala, so I had to wrap my head around the various complexities of a new language as well as a new computational framework. This process was a great in-depth introduction to the world of Big Data (I previously worked as an electrical engineer for Boeing),

Read more

How-to: Translate from MapReduce to Apache Spark (Part 2)

Categories: How-to MapReduce Spark

The conclusion to this series covers Combiner-like aggregation functionality, counters, partitioning, and serialization.

Apache Spark is rising in popularity as an alternative to MapReduce, in a large part due to its expressive API for complex data processing. A few months ago, my colleague, Sean Owen wrote a post describing how to translate functionality from MapReduce into Spark, and in this post, I’ll extend that conversation to cover additional functionality.

Read more

Text Mining with Impala

Categories: Guest Impala Use Case

Thanks to Torsten Kilias and Alexander Löser of the Beuth University of Applied Sciences in Berlin for the following guest post about their INDREX project and its integration with Impala for integrated management of textual and relational data.

Textual data is a core source of information in the enterprise. Example demands arise from sales departments (monitor and identify leads), human resources (identify professionals with capabilities in ‘xyz’), market research (campaign monitoring from the social web),

Read more