Category Archives: HDFS

Apache Solr Memory Tuning for Production

Categories: CDH HDFS Search

Configuring Apache Solr memory properly is critical for production system stability and performance. It can be hard to find the right balance between competing goals. There are also multiple factors, implicit or explicit, that need to be taken into consideration. This blog talks about some common tasks in memory tuning and guides you through the process to help you understand how to configure Solr memory for a production system.

For simplicity, this blog applies to Solr in Cloudera CDH5.11 running on top of HDFS.

Read More

HDFS Maintenance State

Categories: CDH HDFS

Introduction:

System maintenance operations such as updating operating systems, and applying security patches or hotfixes are routine operations in any data center. DataNodes undergoing such maintenance operations can go offline for anywhere from a few minutes to several hours. By design, Apache Hadoop HDFS can handle DataNodes going down. However, any uncoordinated maintenance operations on several DataNodes at the same time could lead to temporary data availability issues. HDFS currently supports the following features for performing planned maintenance activity:

  1. Rolling Upgrade
  2. Decommission
  3. HDFS supports using Maintenance State (Starting with CDH 5.11)

The rolling upgrade process helps to upgrade the cluster software without taking the cluster offline.

Read More

HDFS DataNode Scanners and Disk Checker Explained

Categories: CDH Hadoop HDFS

As many of us know, data in HDFS is stored in DataNodes, and HDFS can tolerate DataNode failures by replicating the same data to multiple DataNodes. But exactly what happens if some DataNodes’ disks are failing? This blog post explains how some of the background work is done on the DataNodes to help HDFS to manage its data across multiple DataNodes for fault tolerance. Particularly, we will explain block scanner, volume scanner,

Read More

Achieving a 300% speedup in ETL with Apache Spark

Categories: Data Ingestion General Hadoop HDFS Spark

A common design pattern often emerges when teams begin to stitch together existing systems and an EDH cluster: file dumps, typically in a format like CSV, are regularly uploaded to EDH, where they are then unpacked, transformed into optimal query format, and tucked away in HDFS where various EDH components can use them. When these file dumps are large or happen very often, these simple steps can significantly slow down an ingest pipeline. Part of this delay is inevitable;

Read More

How-to: Use the New HDFS Intra-DataNode Disk Balancer in Apache Hadoop

Categories: CDH Hadoop HDFS

HDFS now includes (shipping in CDH 5.8.2 and later) a comprehensive storage capacity-management approach for moving data across nodes.

In HDFS, the DataNode spreads the data blocks into local filesystem directories, which can be specified using dfs.datanode.data.dir in hdfs-site.xml. In a typical installation, each directory, called a volume in HDFS terminology, is on a different device (for example, on separate HDD and SSD).

When writing new blocks to HDFS,

Read More