Category Archives: HDFS

Progress Report: Bringing Erasure Coding to Apache Hadoop

Categories: Hadoop HDFS Performance

Get an update on the progress of the effort to bring erasure coding to HDFS, including a report about fresh performance benchmark testing results.

About a year ago, the Apache Hadoop community began the HDFS-EC project to build native erasure coding support inside HDFS (currently targeted for the 2.9/3.0 release). Since then, we have designed and implemented basic functionalities in the first phase of the project under HDFS-7285,

Read More

New in CDH 5.5: Apache Parquet Usability Improvements

Categories: CDH HDFS Hive Impala Parquet Performance

Fixes in CDH 5.5 make writing Parquet data for Apache Impala (incubating) much easier.

Over the last few months, several Cloudera customers have provided the feedback that Parquet is too hard to configure, with the main problem being finding the right layout for great performance in Impala. For that reasons, CDH 5.5 contains new features that make those configuration problems go away.

Auto-Detection of HDFS Block Size

For example,

Read More

DistCp Performance Improvements in Apache Hadoop

Categories: CDH Hadoop HDFS Performance Tools

Recent improvements to Apache Hadoop’s native backup utility, which are now shipping in CDH, make that process much faster.

DistCp is a popular tool in Apache Hadoop for periodically backing up data across and within clusters. (Each run of DistCp in the backup process is referred to as a backup cycle.) Its popularity has grown in popularity despite relatively slow performance.

In this post, we’ll provide a quick introduction to DistCp.

Read More

New in Cloudera Labs: Apache HTrace (incubating)

Categories: CDH Cloudera Labs HDFS Performance

Via a combination of beta functionality in CDH 5.5 and new Cloudera Labs packages, you now have access to Apache HTrace for doing performance tracing of your HDFS-based applications.

HTrace is a new Apache incubator project that provides a bird’s-eye view of the performance of a distributed system. While log files can provide a peek into important events on a specific node, and metrics can answer questions about aggregate performance,

Read More

Kudu: New Apache Hadoop Storage for Fast Analytics on Fast Data

Categories: Hadoop HBase HDFS Impala Kudu Performance Spark

This new open source complement to HDFS and Apache HBase is designed to fill gaps in Hadoop’s storage layer that have given rise to stitched-together, hybrid architectures.

The set of data storage and processing technologies that define the Apache Hadoop ecosystem are expansive and ever-improving, covering a very diverse set of customer use cases used in mission-critical enterprise applications. At Cloudera, we’re constantly pushing the boundaries of what’s possible with Hadoop—making it faster,

Read More