Category Archives: CDH

Introducing S3Guard: S3 Consistency for Apache Hadoop

Categories: Altus CDH Cloud Hadoop

Synopsis

This article introduces a new Apache Hadoop feature called S3Guard. S3Guard addresses one of the major challenges with running Hadoop on Amazon’s Simple Storage Service (S3), eventual consistency. We outline the problem of S3’s eventual consistency, how it affects Hadoop workloads, and explain how S3Guard works.

Problem

Although Apache Hadoop has support for using Amazon Simple Storage Service (S3) as a Hadoop filesystem, S3 behaves different than HDFS.  One of the key differences is in the level of consistency provided by the underlying filesystem.

Read more

Using Amazon S3 with Cloudera BDR

Categories: CDH Cloud Cloudera Manager HDFS Hive

More of you are moving to public cloud services for backup and disaster recovery purposes, and Cloudera has been enhancing the capabilities of Cloudera Manager and CDH to help you do that. Specifically, Cloudera Backup and Disaster Recovery (BDR) now supports backup to and restore from Amazon S3 for Cloudera Enterprise customers.

BDR lets you replicate Apache HDFS data from your on-premise cluster to or from Amazon S3 with full fidelity (all file and directory metadata is replicated along with the data).

Read more

Quicker Insight into Apache Solr and Collection Health

Categories: CDH Cloudera Manager How-to Search

Successful cluster administration can be very difficult without a real-time view of the state of the cluster. Solr itself does not provide aggregated views about its state or any historical usage data, which is necessary to understand how the service is used and how it is performing. Knowing the throughput and capacities not only helps detect errors and troubleshoot issues, but is also useful for capacity planning.

Questions may arise, such as:

  • What is the size of my cluster and each collection?

Read more

implyr: R Interface for Apache Impala

Categories: CDH Data Science HBase HDFS Impala Kudu Tools

New R package implyr enables R users to query Impala using dplyr.

Apache Impala (incubating) enables low-latency interactive SQL queries on data stored in HDFS, Amazon S3, Apache Kudu, and Apache HBase. With the availability of the R package implyr on CRAN and GitHub, it’s now possible to query Impala from R using the popular package dplyr.

dplyr provides a grammar of data manipulation,

Read more