Tag Archives: CDH

Offset Management For Apache Kafka With Apache Spark Streaming

Categories: CDH Kafka Spark

An ingest pattern that we commonly see being adopted at Cloudera customers is Apache Spark Streaming applications which read data from Kafka. Streaming data continuously from Kafka has many benefits such as having the capability to gather insights faster. However, users must take into consideration management of Kafka offsets in order to recover their streaming application from failures. In this post, we will provide an overview of Offset Management and following topics.

  • Storing offsets in external data stores
    • Checkpoints
    • HBase
    • ZooKeeper
    • Kafka
  • Not managing offsets

Overview of Offset Management

Spark Streaming integration with Kafka allows users to read messages from a single Kafka topic or multiple Kafka topics.

Read More

How-to: Fuzzy Name Indexing in Apache Hadoop with Rosette and Cloudera Search

Categories: CDH How-to Search

In this guide, learn how to use Cloudera Search with Basis Technology’s Rosette®  to perform fuzzy name searches in multiple languages and scripts.

Our thanks to Basis Technology team (Jeanne Le Garrec, Hannah MacKenzie-Margulies and Brian Sawyer) for supporting writing this how-to blog.

Cloudera Search, powered by Apache Solr brings full-text, interactive search, and scalable indexing to Apache Hadoop by marrying SolrCloud with HDFS, Apache HBase,

Read More

What’s New in Cloudera Director 2.2?

Categories: CDH Cloud Cloudera Manager Hadoop

This new release adds support for Amazon EBS volumes and the ability to diagnose cluster bootstrap errors quickly.

Cloudera Director provides a simple, reliable, enterprise-grade way to deploy, scale, and manage Apache Hadoop in the cloud of your choice. Cloudera Director enables you to deploy production-ready clusters for big data applications and successfully run workloads in the cloud.

Cloudera Director makes it easier for customers to:

  • Deploy clusters in line with patterns native to cloud infrastructure
  • Use an interface to define in one place the desired cluster specification all the way down to the operating system
  • Repeatedly and programmatically instantiate these cluster definitions
  • Adapt to the dynamic nature of cloud infrastructure

Cloudera Director 2.2 provides additional mechanisms to get that initial cluster definition right and the ability to diagnose errors and iterate quickly.

Read More

Progress Report: Hive-on-Spark Nears Production Readiness

Categories: Cloudera Labs Hive Spark

Contributors from Intel, Cloudera, and the rest of the community have been making strong progress on the Hive-on-Spark initiative. This post provides an update.

[Editor’s note (April 20, 2016): Hive-on-Spark is now GA/shipping starting in CDH 5.7.]

Since its inception about one year ago, the community initiative to make Apache Spark a data processing engine for Apache Hive (HIVE-7292) has attracted widespread interest from developers around the world and gone through phases of rapid development,

Read More

Docker is the New QuickStart Option for Apache Hadoop and Cloudera

Categories: CDH Ops and DevOps QuickStart VM Testing

Now there’s an even quicker “QuickStart” option for getting hands-on with the Apache Hadoop ecosystem and Cloudera’s platform: a new Docker image.

docker-logoYou might already be familiar with Cloudera’s popular QuickStart VM, a virtual image containing our distributed data processing platform. Originally intended as a demo environment, the QuickStart VM quickly evolved over time into quite a useful general-purpose environment for developers, customers, and partners. Today,

Read More