New in Cloudera Labs: Envelope (for Apache Spark Streaming)

Categories: Cloudera Labs Data Ingestion Kafka Kudu

As a warm-up to Spark Summit West in San Francisco (June 6-8),  we’ve added a new project to Cloudera Labs that makes building Spark Streaming pipelines considerably easier.

Spark Streaming is the go-to engine for stream processing in the Cloudera stack. It allows developers to build stream data pipelines that harness the rich Spark API for parallel processing, expressive transformations, fault tolerance, and exactly-once processing. But it requires a programmer to write code, and a lot of it is very repetitive!

For batch ETL in Apache Hadoop, this hasn’t been a requirement—most of us have been able to put together elaborate Apache Hive plus Apache Oozie pipelines without ever opening an IDE. Sometimes it does make sense to write ETL with code, but often the task at hand is straightforward enough that we would rather declare what we want to get done rather than code exactly how. Stream processing should be that way, too.

Envelope is a new Cloudera Labs project that aims to fill that gap. It’s a pre-developed Spark Streaming application that you can use to create your own streaming pipeline by simply providing a configuration file that defines what you want your pipeline to do. The focus of Envelope is on streams arriving into Hadoop that need some level of processing or transformation, and that then need to land in a storage layer for end users to access. In other words, your data is arriving as a stream and you want to do some work on it before people start using it. To build that from scratch with the Spark Streaming API would require a lot of plumbing code, but with Envelope you can skip that coding and just focus on what is individual about your pipeline.

What Can Envelope Do?

Envelope implements many aspects of a typical streaming pipeline:

  • Ingestion of a stream into the Spark Streaming job. An implementation is included for Apache Kafka.
  • Translation of a stream into typed records. This brings the data into a common format for simpler transformation. Implementations are included for delimited text, key-value pairs, and Apache Avro records.
  • Lookups of existing records in the storage layer. This is useful for enriching your stream.
  • SQL transformation of the stream data model into the storage data model. Simply provide the Spark SQL query and Envelope will run it for you over your stream.
  • Plan how to write the derived stream to the storage layer. Implementations are provided for plain appends, for upserts, and for tracking history using Type II modeling.
  • Write the planned changes to the storage table. An implementation is included for Apache Kudu (incubating).

Where Envelope provides implementations in the points above, you can do the same by providing your own class and referencing it in the pipeline configuration.

You can find more details on Envelope functionality in the project readme and on the available configurations in the project wiki.

How to Get Started

The easiest way to try out Envelope is to compile it and run one of the included example pipelines on your cluster. As at the time of writing, Envelope requires a CDH 5.5 or above cluster, and also Kafka 0.9 and Kudu 0.8 if using those components. Envelope is just another Spark Streaming application, so it doesn’t need to be installed in Cloudera Manager.

There are two examples in Envelope: a pipeline that reads in mock FIX financial trade messages and tracks the execution history of the order, and a pipeline that reads in traffic congestion readings and gives a time-smoothed aggregation of the traffic conditions. Envelope has data generators for these examples so that you can get up and running straight away. Both examples read data from Kafka and write out to Kudu.

You can find more details on the provided examples in the project readme.

Why Kafka and Kudu?

At Cloudera, we are excited about the possibilities of streaming pipelines that flow from Kafka to Kudu. When data lands in Kudu then it is immediately available for SQL analytics through Apache Impala (incubating). This is the vision of near-real-time analytics: process the data live and land it right into where analytics queries can be run. Building out an effective pipeline like this can be a powerful advantage when it comes to making decisions on events that are happening right now in your organization.

How to Get Help

You can reach out to us on the Cloudera Labs community forum where you can ask questions and give any feedback that comes to mind. As a fair warning, Envelope, like all Cloudera Labs projects, is not officially supported by Cloudera, and as a newly formed project it isn’t ready for production without complete and rigorous testing of your own.

Future Work

There are many additions that would make sense for Envelope, such as more translators, or other storage systems such as Apache HBase and Apache Solr. We have captured some of these on the project issues list. If you have made enhancements that you’d like to propose, feel free to send a pull request or add an issue with your comments.

Jeremy Beard is a Senior Solutions Architect at Cloudera.


Leave a Reply

Your email address will not be published. Required fields are marked *