Spark Dataflow Joins Google’s Dataflow SDK

Categories: Cloud Cloudera Labs General Spark

Spark Dataflow from Cloudera Labs is now part of Google’s New Dataflow SDK, which will be proposed to the Apache Incubator.

Spark Dataflow is an experimental implementation of Google’s Dataflow programming model that runs on Apache Spark. The initial implementation was written by Josh Wills, and entered Cloudera Labs exactly a year ago. Since then, we’ve seen a number of contributions to the project, culminating in the recent addition of an implementation of streaming (running on Spark Streaming) by Amit Sela from PayPal.

The most notable feature of the Dataflow model is the way that it combines batch and streaming processing in a single API. This feature allows developers to reuse business logic in different pipelines regardless of whether they are batch or streaming pipelines. Another way of looking at this is in terms of a trade-off between latency and throughput: batch trades latency for throughput, whereas streaming trades throughput for latency. This trade-off has always been present in data processing systems, the difference is that previously developers had to make the trade-off very early in the design process by selecting the system (and therefore the API) they were going to use (e.g. MapReduce for throughput, Apache Storm for latency). With Dataflow the trade-off can be made much later in the process: by changing a few lines of code so that a pipeline reads from an unbounded source (like an Apache Kafka topic) rather than a bounded source (like an HDFS file). The processing logic can be the same in both cases.

Today, Google announced that it will propose the Dataflow SDK for incubation at the Apache Software Foundation (ASF). We are pleased to announce that the Spark Dataflow runner will form a part of the same incubating project. The Apache Flink Dataflow runner will also be contributed, which opens up the possibility of sharing code and tests between the SDK and the runners in the future. (In the context of Dataflow, a runner is a client-side abstraction that runs your pipeline for you, perhaps locally, using the local runner, or in a distributed fashion using one of the other runners, like the Cloud runner, or the Spark runner.)

Cloudera employees have worked with many projects at the ASF from the very beginning, and we recognize the value that being supported by an open and diverse community gives to a project. With Dataflow moving to the ASF, we look forward to seeing more people get involved in advancing the technology behind it.

Tom White is a data scientist at Cloudera, and a member of the Apache Hadoop PMC.