Category Archives: Kafka

How-to: Ingest Email into Apache Hadoop in Real Time for Analysis

Categories: Data Ingestion Flume Hadoop Kafka Search Spark Use Case

Apache Hadoop is a proven platform for long-term storage and archiving of structured and unstructured data. Related ecosystem tools, such as Apache Flume and Apache Sqoop, allow users to easily ingest structured and semi-structured data without requiring the creation of custom code. Unstructured data, however, is a more challenging subset of data that typically lends itself to batch-ingestion methods. Although such methods are suitable for many use cases,

Read more

New in Cloudera Labs: Envelope (for Apache Spark Streaming)

Categories: Cloudera Labs Data Ingestion Kafka Kudu

As a warm-up to Spark Summit West in San Francisco (June 6-8),  we’ve added a new project to Cloudera Labs that makes building Spark Streaming pipelines considerably easier.

Spark Streaming is the go-to engine for stream processing in the Cloudera stack. It allows developers to build stream data pipelines that harness the rich Spark API for parallel processing, expressive transformations, fault tolerance, and exactly-once processing. But it requires a programmer to write code,

Read more

Inside Santander’s Near Real-Time Data Ingest Architecture (Part 2)

Categories: HBase Kafka Use Case

Thanks to Pedro Boado and Abel Fernandez Alfonso from Santander’s engineering team for their collaboration on this post about how Santander UK is using Apache HBase as a near real-time serving engine to power its innovative Spendlytics app.

The Spendlytics iOS app is designed to help Santander’s personal debit and credit-card customers keep on top of their spending, including payments made via Apple Pay. It uses real-time transaction data to enable customers to analyze their card spend across time periods (weekly,

Read more

Building, Benchmarking, and Tuning Syslog Ingest Architecture at Vodafone UK

Categories: Flume Hadoop Kafka Platform Security & Cybersecurity Use Case

Vodafone UK’s new SIEM system relies on Apache Flume and Apache Kafka to ingest nearly 1 million events per second. In this post, learn about the architecture and performance-tuning techniques and that got it there.

SIEM platforms provide a useful tool for identifying indicators of compromise across disparate infrastructure. The catch is, they’re only as accurate as the fidelity of the data involved, which is why Apache Hadoop is becoming such a valuable platform for that use case.

Read more

What’s New in Cloudera’s Distribution of Apache Kafka?

Categories: Kafka Platform Security & Cybersecurity

Cloudera’s distribution (now on release 2.0) of Kafka is based on Apache Kafka 0.9 and includes various new features (especially for security and usability), enhancements, and bug fixes.

Kafka is rapidly gaining momentum in enterprise Apache Hadoop deployments and has become the de facto messaging bus in most Big Data technology stacks. During this period of rapid adoption (and since Cloudera began shipping Kafka in February 2015),

Read more