Category Archives: HDFS

A Look at ADLS Performance – Throughput and Scalability

Categories: CDH Cloud Hadoop HDFS Performance

Overview

Azure Data Lake Store (ADLS) is a highly scalable cloud-based data store that is designed  for collecting, storing and analyzing large amounts of data, and is ideal for enterprise-grade applications.  Data can originate from almost any source, such as Internet applications and mobile devices; it is stored securely and durably, while being highly available in any geographic region.  ADLS is performance-tuned for big data analytics and can be easily accessed from many components of the Apache Hadoop ecosystem,

Read more

Using Amazon S3 with Cloudera BDR

Categories: CDH Cloud Cloudera Manager HDFS Hive

More of you are moving to public cloud services for backup and disaster recovery purposes, and Cloudera has been enhancing the capabilities of Cloudera Manager and CDH to help you do that. Specifically, Cloudera Backup and Disaster Recovery (BDR) now supports backup to and restore from Amazon S3 for Cloudera Enterprise customers.

BDR lets you replicate Apache HDFS data from your on-premise cluster to or from Amazon S3 with full fidelity (all file and directory metadata is replicated along with the data).

Read more

implyr: R Interface for Apache Impala

Categories: CDH Data Science HBase HDFS Impala Kudu Tools

New R package implyr enables R users to query Impala using dplyr.

Apache Impala (incubating) enables low-latency interactive SQL queries on data stored in HDFS, Amazon S3, Apache Kudu, and Apache HBase. With the availability of the R package implyr on CRAN and GitHub, it’s now possible to query Impala from R using the popular package dplyr.

dplyr provides a grammar of data manipulation,

Read more

Apache Solr Memory Tuning for Production

Categories: CDH HDFS Search

Configuring Apache Solr memory properly is critical for production system stability and performance. It can be hard to find the right balance between competing goals. There are also multiple factors, implicit or explicit, that need to be taken into consideration. This blog talks about some common tasks in memory tuning and guides you through the process to help you understand how to configure Solr memory for a production system.

For simplicity, this blog applies to Solr in Cloudera CDH5.11 running on top of HDFS.

Read more

HDFS Maintenance State

Categories: CDH HDFS

Introduction:

System maintenance operations such as updating operating systems, and applying security patches or hotfixes are routine operations in any data center. DataNodes undergoing such maintenance operations can go offline for anywhere from a few minutes to several hours. By design, Apache Hadoop HDFS can handle DataNodes going down. However, any uncoordinated maintenance operations on several DataNodes at the same time could lead to temporary data availability issues. HDFS currently supports the following features for performing planned maintenance activity:

  1. Rolling Upgrade
  2. Decommission
  3. HDFS supports using Maintenance State (Starting with CDH 5.11)

The rolling upgrade process helps to upgrade the cluster software without taking the cluster offline.

Read more