Category Archives: Kafka

Robust Message Serialization in Apache Kafka Using Apache Avro, Part 1

Categories: Avro CDH How-to Kafka

In Apache Kafka, Java applications called producers write structured messages to a Kafka cluster (made up of brokers). Similarly, Java applications called consumers read these messages from the same cluster.  In some organizations, there are different groups in charge of writing and managing the producers and consumers. In such cases, one major pain point can be in the coordination of the agreed upon message format between producers and consumers.

This example demonstrates how to use Apache Avro to serialize records that are produced to Apache Kafka while allowing evolution of schemas and nonsynchronous update of producer and consumer applications.

Read more

Scalability of Kafka Messaging using Consumer Groups

Categories: Data Ingestion Flume Kafka Use Case

Traditional messaging models fall into two categories: Shared Message Queues and Publish-Subscribe models. Both models have their own pros and cons. Neither could successfully handle big data ingestion at scale due to limitations in their design. Apache Kafka implements a publish-subscribe messaging model which provides fault tolerance, scalability to handle large volumes of streaming data for real-time analytics. It was developed at LinkedIn in 2010 to meet its growing data pipeline needs. Apache Kafka bridges the gaps that traditional messaging models failed to achieve.

Read more

Cloudera Enterprise 5.12 is Now Available

Categories: Altus CDH Cloud Cloudera Manager Cloudera Navigator Data Science Hue Impala Kafka Kudu

Cloudera is pleased to announce that Cloudera Enterprise 5.12 is now generally available (GA). The release includes enhancements for running in cloud environments (with broader ADLS support and improved AWS Spot Instance support), usability and productivity improvements for both data science and analytic workloads, as well as performance gains and self-service performance management across a range of workloads.

As usual, there are also a number of quality enhancements, bug fixes, and other improvements across the stack.

Read more

Offset Management For Apache Kafka With Apache Spark Streaming

Categories: CDH Kafka Spark

An ingest pattern that we commonly see being adopted at Cloudera customers is Apache Spark Streaming applications which read data from Kafka. Streaming data continuously from Kafka has many benefits such as having the capability to gather insights faster. However, users must take into consideration management of Kafka offsets in order to recover their streaming application from failures. In this post, we will provide an overview of Offset Management and following topics.

  • Storing offsets in external data stores
    • Checkpoints
    • HBase
    • ZooKeeper
    • Kafka
  • Not managing offsets

Overview of Offset Management

Spark Streaming integration with Kafka allows users to read messages from a single Kafka topic or multiple Kafka topics.

Read more

Reading data securely from Apache Kafka to Apache Spark

Categories: CDH Kafka Platform Security & Cybersecurity Sentry Spark

Introduction

With an ever-increasing number of IoT use cases on the CDH platform, security for such workloads is of paramount importance. This blog post describes how one can consume data from Kafka in Spark, two critical components for IoT use cases, in a secure manner.

The Cloudera Distribution of Apache Kafka 2.0.0 (based on Apache Kafka 0.9.0) introduced a new Kafka consumer API that allowed consumers to read data from a secure Kafka cluster.

Read more