Category Archives: Data Ingestion

Scalability of Kafka Messaging using Consumer Groups

Categories: Data Ingestion Flume Kafka Use Case

Traditional messaging models fall into two categories: Shared Message Queues and Publish-Subscribe models. Both models have their own pros and cons. Neither could successfully handle big data ingestion at scale due to limitations in their design. Apache Kafka implements a publish-subscribe messaging model which provides fault tolerance, scalability to handle large volumes of streaming data for real-time analytics. It was developed at LinkedIn in 2010 to meet its growing data pipeline needs. Apache Kafka bridges the gaps that traditional messaging models failed to achieve.

Read more

Bi-temporal data modeling with Envelope

Categories: CDH Data Ingestion Impala Kudu Spark

One of the most fundamental aspects a data model can convey is how something changes over time. This makes sense when considering that we build data models to capture what is happening in the real world, and the real world is constantly changing. The challenge is that it’s not just that new things are occurring, it’s that existing things are changing too, and if in our data models we overwrite the old state of an entity with the new state then we have lost information about the change.

Read more

Up and running with Apache Spark on Apache Kudu

Categories: CDH Data Ingestion Data Science General Hadoop How-to Impala Kudu Spark Training Use Case

After the GA of Apache Kudu in Cloudera CDH 5.10, we take a look at the Apache Spark on Kudu integration, share code snippets, and explain how to get up and running quickly, as Kudu is already a first-class citizen in Spark’s ecosystem.

 

As the Apache Kudu development team celebrates the initial 1.0 release launched on September 19, and the most recent 1.2.0 version now GA as part of Cloudera’s CDH 5.10 release,

Read more

Achieving a 300% speedup in ETL with Apache Spark

Categories: Data Ingestion General Hadoop HDFS Spark

A common design pattern often emerges when teams begin to stitch together existing systems and an EDH cluster: file dumps, typically in a format like CSV, are regularly uploaded to EDH, where they are then unpacked, transformed into optimal query format, and tucked away in HDFS where various EDH components can use them. When these file dumps are large or happen very often, these simple steps can significantly slow down an ingest pipeline. Part of this delay is inevitable;

Read more

Skool: An Open Source Data Integration Tool for Apache Hadoop from BT Group

Categories: Data Ingestion Guest Hadoop

In this guest post, Skool’s architects at BT Group explain its origins, design, and functionality.

With increased adoption of big data comes the challenge of integrating existing data sitting in various relational and file-based systems with Apache Hadoop infrastructure. Although open source connectors (such as Apache Sqoop) and utilities (such as Httpfs/Curl on Linux) make it easy to exchange data, data engineering teams often spend an inordinate amount of time writing code for this purpose.

Read more