An ingest pattern that we commonly see being adopted at Cloudera customers is Apache Spark Streaming applications which read data from Kafka. Streaming data continuously from Kafka has many benefits such as having the capability to gather insights faster. However, users must take into consideration management of Kafka offsets in order to recover their streaming application from failures. In this post, we will provide an overview of Offset Management and following topics.
- Storing offsets in external data stores
- Not managing offsets
Overview of Offset Management
Spark Streaming integration with Kafka allows users to read messages from a single Kafka topic or multiple Kafka topics.
With an ever-increasing number of IoT use cases on the CDH platform, security for such workloads is of paramount importance. This blog post describes how one can consume data from Kafka in Spark, two critical components for IoT use cases, in a secure manner.
The Cloudera Distribution of Apache Kafka 2.0.0 (based on Apache Kafka 0.9.0) introduced a new Kafka consumer API that allowed consumers to read data from a secure Kafka cluster.
Learn about the new Apache Flume and Apache Kafka integration (aka, “Flafka”) available in CDH 5.8 and its support for the new enterprise features in Kafka 0.9.
Over a year ago, we wrote about the integration of Flume and Kafka (Flafka) for data ingest into Apache Hadoop. Since then, Flafka has proven to be quite popular among CDH users, and we believe that popularity is based on the fact that in Kafka deployments,
Apache Hadoop is a proven platform for long-term storage and archiving of structured and unstructured data. Related ecosystem tools, such as Apache Flume and Apache Sqoop, allow users to easily ingest structured and semi-structured data without requiring the creation of custom code. Unstructured data, however, is a more challenging subset of data that typically lends itself to batch-ingestion methods. Although such methods are suitable for many use cases,
As a warm-up to Spark Summit West in San Francisco (June 6-8), we’ve added a new project to Cloudera Labs that makes building Spark Streaming pipelines considerably easier.
Spark Streaming is the go-to engine for stream processing in the Cloudera stack. It allows developers to build stream data pipelines that harness the rich Spark API for parallel processing, expressive transformations, fault tolerance, and exactly-once processing. But it requires a programmer to write code,