Category Archives: HDFS

Achieving a 300% speedup in ETL with Apache Spark

Categories: Data Ingestion General Hadoop HDFS Spark

A common design pattern often emerges when teams begin to stitch together existing systems and an EDH cluster: file dumps, typically in a format like CSV, are regularly uploaded to EDH, where they are then unpacked, transformed into optimal query format, and tucked away in HDFS where various EDH components can use them. When these file dumps are large or happen very often, these simple steps can significantly slow down an ingest pipeline. Part of this delay is inevitable;

Read more

How-to: Use the New HDFS Intra-DataNode Disk Balancer in Apache Hadoop

Categories: CDH Hadoop HDFS

HDFS now includes (shipping in CDH 5.8.2 and later) a comprehensive storage capacity-management approach for moving data across nodes.

In HDFS, the DataNode spreads the data blocks into local filesystem directories, which can be specified using dfs.datanode.data.dir in hdfs-site.xml. In a typical installation, each directory, called a volume in HDFS terminology, is on a different device (for example, on separate HDD and SSD).

When writing new blocks to HDFS,

Read more

Introducing Apache Arrow: A Fast, Interoperable In-Memory Columnar Data Structure Standard

Categories: Data Science General HDFS Impala Kudu Performance

Engineers from across the Apache Hadoop community are collaborating to establish Arrow as a de-facto standard for columnar in-memory processing and interchange. Here’s how it works.

Apache Arrow is an in-memory data structure specification for use by engineers building data systems. It has several key benefits:

  • A columnar memory-layout permitting O(1) random access. The layout is highly cache-efficient in analytics workloads and permits SIMD optimizations with modern processors. Developers can create very fast algorithms which process Arrow data structures.

Read more

Progress Report: Bringing Erasure Coding to Apache Hadoop

Categories: Hadoop HDFS Performance

Get an update on the progress of the effort to bring erasure coding to HDFS, including a report about fresh performance benchmark testing results.

About a year ago, the Apache Hadoop community began the HDFS-EC project to build native erasure coding support inside HDFS (currently targeted for the 2.9/3.0 release). Since then, we have designed and implemented basic functionalities in the first phase of the project under HDFS-7285, and have merged the changes to the Hadoop trunk.

Read more

New in CDH 5.5: Apache Parquet Usability Improvements

Categories: CDH HDFS Hive Impala Parquet Performance

Fixes in CDH 5.5 make writing Parquet data for Apache Impala (incubating) much easier.

Over the last few months, several Cloudera customers have provided the feedback that Parquet is too hard to configure, with the main problem being finding the right layout for great performance in Impala. For that reasons, CDH 5.5 contains new features that make those configuration problems go away.

Auto-Detection of HDFS Block Size

For example, you may have seen this warning: Read <some-big-number>

Read more