Cloudera and Google are collaborating to bring Google Cloud Dataflow to Apache Spark users (and vice-versa). This new project is now incubating in Cloudera Labs!
“The future is already here—it’s just not evenly distributed.” —William Gibson
For the past decade, a lot of the future has been concentrated at Google’s headquarters in Mountain View. Because of the scale of its operations, Google usually bumped up against the limitations of the current state-of-the-art before anyone else, and was required to come up with its own solutions to the problems it encountered. From time to time, it would publish its solutions, either in the form of open source software projects like Guava or protocol buffers, or as research papers that would challenge and inspire the broader academic and open source communities. Open source projects like Apache Hadoop, Apache HBase, and Apache Parquet (incubating) were all inspired by research papers that Google published about their internal data management systems.
With the release of Cloud Dataflow, Google is leveraging its cloud computing infrastructure to provide a service that developers can use to execute their own batch and streaming data pipelines. Cloud Dataflow is a descendent of the FlumeJava (PDF) batch processing engine (which served as inspiration for both Apache Crunch and Apache Spark, the new standard for data processing on Hadoop) that has been extended to support stream processing using ideas from Google’s Millwheel project. Even better, Google has released the Dataflow SDK as an Apache-licensed project that can support alternative backends, and Cloudera was pleased to collaborate with our friends at Google on a version of Dataflow that runs on Apache Spark. This new Dataflow “runner,” which allows users to target a Dataflow pipeline for execution on Spark, has now joined Cloudera Labs as an incubating project (as usual, for testing and experimentation only.)
One of the most compelling aspects of Cloud Dataflow is its approach to one of the most difficult problems facing data engineers: how to develop pipeline logic that can execute in both batch and streaming contexts. Although the lambda architecture, best represented by Twitter’s Summingbird project, has been the recommended approach for the last few years, Jay Kreps wrote a blog post in July 2014 that argued for a “kappa architecture” based on a streaming-oriented execution model built on top of Apache Kafka. Cloud Dataflow ends up between these two extremes: the streaming execution engine has strong consistency guarantees and provides a windowing model that is even more advanced than the one in Spark Streaming, but there is still a distinct batch execution engine that is capable of performing additional optimizations to pipelines that do not process streaming data. Crucially, the client API for the batch and the stream processing engines are identical, so that any operation that can be performed in one context can also be performed in the other, and moving a pipeline from batch mode to streaming mode should be as seamless as possible.
You can begin constructing your own Dataflow pipelines for local execution by downloading the SDK and reading the getting started guide (see also the StackOverflow tag for Dataflow). Instructions for setting up the Spark runner for Dataflow are in the README on our github repository, along with a simple example pipeline you can setup and run. Note that the Spark runner currently requires Apache Spark version 1.2, which ships as part of CDH 5.3.0, and currently only supports batch pipelines as we work on extending Spark Streaming to support all of the windowing functionality provided by Dataflow.
Enjoy! To provide feedback about the new Spark runner for Dataflow, use our Cloudera Labs discussion forum.
Josh Wills is Cloudera’s Senior Director of Data Science.