Cloudera Engineering Blog · Spark Posts

Working with Apache Spark: Or, How I Learned to Stop Worrying and Love the Shuffle

Our thanks to Ilya Ganelin, Senior Data Engineer at Capital One Labs, for the guest post below about his hard-earned lessons from using Spark.

I started using Apache Spark in late 2014, learning it at the same time as I learned Scala, so I had to wrap my head around the various complexities of a new language as well as a new computational framework. This process was a great in-depth introduction to the world of Big Data (I previously worked as an electrical engineer for Boeing), and I very quickly found myself deep in the guts of Spark. The hands-on experience paid off; I now feel extremely comfortable with Spark as my go-to tool for a wide variety of data analytics tasks, but my journey here was no cakewalk.

How-to: Translate from MapReduce to Apache Spark (Part 2)

The conclusion to this series covers Combiner-like aggregation functionality, counters, partitioning, and serialization.

Apache Spark is rising in popularity as an alternative to MapReduce, in a large part due to its expressive API for complex data processing. A few months ago, my colleague, Sean Owen wrote a post describing how to translate functionality from MapReduce into Spark, and in this post, I’ll extend that conversation to cover additional functionality.

How-to: Tune Your Apache Spark Jobs (Part 2)

In the conclusion to this series, learn how resource tuning, parallelism, and data representation affect Spark job performance.

In this post, we’ll finish what we started in “How to Tune Your Apache Spark Jobs (Part 1)”. I’ll try to cover pretty much everything you could care to know about making a Spark program run fast. In particular, you’ll learn about resource tuning, or configuring Spark to take advantage of everything the cluster has to offer. Then we’ll move to tuning parallelism, the most difficult as well as most important parameter in job performance. Finally, you’ll learn about representing the data itself, in the on-disk form which Spark will read (spoiler alert: use Apache Avro or Apache Parquet) as well as the in-memory format it takes as it’s cached or moves through the system.

Tuning Resource Allocation

How Used Spark Streaming to Build a Near Real-Time Dashboard

Thanks to Sam Shuster, Software Engineer at, for the guest post below about his company’s use case for Spark Streaming, SparkOnHBase, and Morphlines.

Every year, the Super Bowl brings parties, food and hopefully a great game to appease everyone’s football appetites until the fall. With any event that brings in around 114 million viewers with larger numbers each year, Americans have also grown accustomed to commercials with production budgets on par with television shows and with entertainment value that tries to rival even the game itself.

How-to: Build Re-usable Spark Programs using Spark Shell and Maven

Set up your own, or even a shared, environment for doing interactive analysis of time-series data.

Although software engineering offers several methods and approaches to produce robust and reliable components, a more lightweight and flexible approach is required for data analysts—who do not build “products” per se but still need high-quality tools and components. Thus, recently, I tried to find a way to re-use existing libraries and datasets stored already in HDFS with Apache Spark.

Exactly-once Spark Streaming from Apache Kafka

Thanks to Cody Koeninger, Senior Software Engineer at Kixer, for the guest post below about Apache Kafka integration points in Apache Spark 1.3. Spark 1.3 will ship in CDH 5.4.

The new release of Apache Spark, 1.3, includes new experimental RDD and DStream implementations for reading data from Apache Kafka. As the primary author of those features, I’d like to explain their implementation and usage. You may be interested if you would benefit from:

How-to: Tune Your Apache Spark Jobs (Part 1)

Learn techniques for tuning your Apache Spark jobs for optimal efficiency.

(Editor’s note: Sandy presents on “Estimating Financial Risk with Spark” at Spark Summit East on March 18.)

Calculating CVA with Apache Spark

Thanks to Matthew Dixon, principal consultant at Quiota LLC and Professor of Analytics at the University of San Francisco, and Mohammad Zubair, Professor of Computer Science at Old Dominion University, for this guest post that demonstrates how to easily deploy exposure calculations on Apache Spark for in-memory analytics on scenario data.

Since the 2007 global financial crisis, financial institutions now more accurately measure the risks of over-the-counter (OTC) products. It is now standard practice for institutions to adjust derivative prices for the risk of the counter-party’s, or one’s own, default by means of credit or debit valuation adjustments (CVA/DVA).

Download the Hive-on-Spark Beta

A Hive-on-Spark beta is now available via CDH parcel. Give it a try!

The Hive-on-Spark project (HIVE-7292) is one of the most watched projects in Apache Hive history. It has attracted developers from across the ecosystem, including from organizations such as Intel, MapR, IBM, and Cloudera, and gained critical help from the Spark community.

Data Processing with Apache Crunch at Spotify

Many thanks to David Whiting of Spotify for allowing us to re-publish the following Spotify Labs post about its Apache Crunch use cases.

(Note: Since this post was originally published in November 2014, many of the library functions described have been added into crunch-core, so they’ll soon be available to all Crunch users by default.)

Older Posts