Tag Archives: Flume

How-to: Do Data Quality Checks using Apache Spark DataFrames

Categories: How-to Spark

Apache Spark’s ability to support data quality checks via DataFrames is progressing rapidly. This post explains the state of the art and future possibilities.

Apache Hadoop and Apache Spark make Big Data accessible and usable so we can easily find value, but that data has to be correct, first. This post will focus on this problem and how to solve it with Apache Spark 1.3 and Apache Spark 1.4 using DataFrames.

Read more

Deploying Apache Kafka: A Practical FAQ

Categories: CDH Kafka

This post contains answers to common questions about deploying and configuring Apache Kafka as part of a Cloudera-powered enterprise data hub.

Cloudera added support for Apache Kafka, the open standard for streaming data, in February 2015 after its brief incubation period in Cloudera Labs. Apache Kafka now is an integrated part of CDH, manageable via Cloudera Manager, and we are witnessing rapid adoption of Kafka across our customer base.

Read more

Inside Apache HBase’s New Support for MOBs

Categories: HBase

Learn about the design decisions behind HBase’s new support for MOBs.

Apache HBase is a distributed, scalable, performant, consistent key value database that can store a variety of binary data types. It excels at storing many relatively small values (<10K), and providing low-latency reads and writes.

However, there is a growing demand for storing documents, images, and other moderate objects (MOBs)  in HBase while maintaining low latency for reads and writes.

Read more

Architectural Patterns for Near Real-Time Data Processing with Apache Hadoop

Categories: Data Ingestion Flume Hadoop HBase Kafka Spark

Evaluating which streaming architectural pattern is the best match to your use case is a precondition for a successful production deployment.

The Apache Hadoop ecosystem has become a preferred platform for enterprises seeking to process and understand large-scale data in real time. Technologies like Apache Kafka, Apache Flume, Apache Spark, Apache Storm, and Apache Samza are increasingly pushing the envelope on what is possible. It is often tempting to bucket large-scale streaming use cases together but in reality they tend to break down into a few different architectural patterns,

Read more

Cloudera Enterprise 5.4 is Released

Categories: CDH Cloud Cloudera Manager

We’re pleased to announce the release of Cloudera Enterprise 5.4 (comprising CDH 5.4, Cloudera Manager 5.4, and Cloudera Navigator 2.3).

Cloudera Enterprise 5.4 (Release Notes) reflects critical investments in a production-ready customer experience through  governance, security, performance and deployment flexibility in cloud environments. It also includes support for a significant number of updated open standard components–including Apache Spark 1.3, Impala 2.2, and Apache HBase 1.0 (as well as unsupported beta releases of Hive-on-Spark data processing and OpenStack deployments).

Read more