Cloudera Engineering Blog · Spark Posts
Thanks to Sam Shuster, Software Engineer at Edmunds.com, for the guest post below about his company’s use case for Spark Streaming, SparkOnHBase, and Morphlines.
Every year, the Super Bowl brings parties, food and hopefully a great game to appease everyone’s football appetites until the fall. With any event that brings in around 114 million viewers with larger numbers each year, Americans have also grown accustomed to commercials with production budgets on par with television shows and with entertainment value that tries to rival even the game itself.
Set up your own, or even a shared, environment for doing interactive analysis of time-series data.
Although software engineering offers several methods and approaches to produce robust and reliable components, a more lightweight and flexible approach is required for data analysts—who do not build “products” per se but still need high-quality tools and components. Thus, recently, I tried to find a way to re-use existing libraries and datasets stored already in HDFS with Apache Spark.
Thanks to Cody Koeninger, Senior Software Engineer at Kixer, for the guest post below about Apache Kafka integration points in Apache Spark 1.3. Spark 1.3 will ship in CDH 5.4.
The new release of Apache Spark, 1.3, includes new experimental RDD and DStream implementations for reading data from Apache Kafka. As the primary author of those features, I’d like to explain their implementation and usage. You may be interested if you would benefit from:
Learn techniques for tuning your Apache Spark jobs for optimal efficiency.
(Editor’s note: Sandy presents on “Estimating Financial Risk with Spark” at Spark Summit East on March 18.)
Thanks to Matthew Dixon, principal consultant at Quiota LLC and Professor of Analytics at the University of San Francisco, and Mohammad Zubair, Professor of Computer Science at Old Dominion University, for this guest post that demonstrates how to easily deploy exposure calculations on Apache Spark for in-memory analytics on scenario data.
Since the 2007 global financial crisis, financial institutions now more accurately measure the risks of over-the-counter (OTC) products. It is now standard practice for institutions to adjust derivative prices for the risk of the counter-party’s, or one’s own, default by means of credit or debit valuation adjustments (CVA/DVA).
A Hive-on-Spark beta is now available via CDH parcel. Give it a try!
The Hive-on-Spark project (HIVE-7292) is one of the most watched projects in Apache Hive history. It has attracted developers from across the ecosystem, including from organizations such as Intel, MapR, IBM, and Cloudera, and gained critical help from the Spark community.
Many thanks to David Whiting of Spotify for allowing us to re-publish the following Spotify Labs post about its Apache Crunch use cases.
(Note: Since this post was originally published in November 2014, many of the library functions described have been added into crunch-core, so they’ll soon be available to all Crunch users by default.)
Authored by a substantial portion of Cloudera’s Data Science team (Sean Owen, Sandy Ryza, Uri Laserson, Josh Wills), Advanced Analytics with Spark (currently in Early Release from O’Reilly Media) is the newest addition to the pipeline of ecosystem books by Cloudera engineers. I talked to the authors recently.
Why did you decide to write this book?
Cloudera and Google are collaborating to bring Google Cloud Dataflow to Apache Spark users (and vice-versa). This new project is now incubating in Cloudera Labs!
“The future is already here—it’s just not evenly distributed.” —William Gibson
Cloudera and Intel engineers are collaborating to make Spark’s shuffle process more scalable and reliable. Here are the details about the approach’s design.
What separates computation engines like MapReduce and Apache Spark (the next-generation data processing engine for Apache Hadoop) from embarrassingly parallel systems is their support for “all-to-all” operations. As distributed engines, MapReduce and Spark operate on sub-slices of a dataset partitioned across the cluster. Many operations process single data-points at a time and can be carried out fully within each partition. All-to-all operations must consider the dataset as a whole; the contents of each output record can depend on records that come from many different partitions. In Spark,
reduceByKey are popular examples of these types of operations.