Apache HDFS Archives - Page 3 of 4

September 23, 2015 | Technical

Introduction to HDFS Erasure Coding in Apache Hadoop

Erasure coding, a new feature in HDFS, can reduce storage overhead by approximately 50% compared to replication while maintaining the same durability guarantees. This post explains how it works. HDFS by default replicates each block three times. Replication provides a simple and robust form of redundancy to shield against most failure scenarios. It also eases […]

by Cloudera 13 min read

Apache Hadoop Apache HDFS

May 11, 2015 | Technical

New in CDH 5.4: Hot-Swapping of HDFS DataNode Drives

This new feature gives Hadoop admins the commonplace ability to replace failed DataNode drives without unscheduled downtime. Hot swapping—the process of replacing system components without shutting down the system—is a common and important operation in modern, production-ready systems. Because disk failures are common in data centers, the ability to hot-swap hard drives is a supported […]

by Lei Xu 4 min read

Apache Hadoop Apache HDFS

March 11, 2015 | Technical

Understanding HDFS Recovery Processes (Part 2)

Having a good grasp of HDFS recovery processes is important when running or moving toward production-ready Apache Hadoop. In the conclusion to this two-part post, pipeline recovery is explained. An important design requirement of HDFS is to ensure continuous and correct operations that support production deployments. For that reason, it’s important for operators to understand […]

by Yongjun Zhang 7 min read

Apache Hadoop Apache HDFS

February 13, 2015 | Technical

Understanding HDFS Recovery Processes (Part 1)

Having a good grasp of HDFS recovery processes is important when running or moving toward production-ready Apache Hadoop. An important design requirement of HDFS is to ensure continuous and correct operations to support production deployments. One particularly complex area is ensuring correctness of writes to HDFS in the presence of network and node failures, where […]

by Yongjun Zhang 9 min read

Apache Hadoop Apache HDFS

January 7, 2015 | Technical

New in CDH 5.3: Transparent Encryption in HDFS

Support for transparent, end-to-end encryption in HDFS is now available and production-ready (and shipping inside CDH 5.3 and later). Here’s how it works. Apache Hadoop 2.6 adds support for transparent encryption to HDFS. Once configured, data read from and written to specified HDFS directories will be transparently encrypted and decrypted, without requiring any changes to […]

by Charles Lamb , Yi Liu , Andrew Wang 8 min read

Apache Hadoop Apache HDFS Security, Risk, & Compliance

June 27, 2014 | Technical

Why Extended Attributes are Coming to HDFS

Extended attributes in HDFS will facilitate at-rest encryption for Project Rhino, but they have many other uses, too. Many mainstream Linux filesystems implement extended attributes, which let you associate metadata with a file or directory beyond common “fixed” attributes like filesize, permissions, modification dates, and so on. Extended attributes are key/value pairs in which the […]

by Charles Lamb 3 min read

Apache HDFS Security, Risk, & Compliance

March 5, 2014 | Technical

A Guide to Checkpointing in Hadoop

Understanding how checkpointing works in HDFS can make the difference between a healthy cluster or a failing one. Checkpointing is an essential part of maintaining and persisting filesystem metadata in HDFS. It’s crucial for efficient NameNode recovery and restart, and is an important indicator of overall cluster health. However, checkpointing can also be a source […]

by Cloudera 8 min read

Apache Hadoop Apache HDFS Ops and DevOps

September 16, 2013 | Technical

Protecting your Enterprise Data with HDFS Snapshots

With HDP 1.3 and HDP 2.0 Beta, we introduced the ability to create snapshots to protect important enterprise data sets from user or application errors. HDFS Snapshots are read-only point-in-time copies of the file system. Snapshots can be taken on a subtree of the file system or the entire file system and are: Performant and Reliable: […]

by Rohit Bakhshi 2 min read

Apache HDFS Hortonworks Data Platform

October 31, 2012 | Business

Quorum-based Journaling in CDH4.1

A few weeks back, Cloudera announced CDH 4.1, the latest update release to Cloudera’s Distribution including Apache Hadoop. This is the first release to introduce truly standalone High Availability for the HDFS NameNode, with no dependence on special hardware or external software. This post explains the inner workings of this new feature from a developer’s […]

by Todd Lipcon 9 min read

Apache HDFS Cloudera Enterprise

December 2, 2011 | Technical

WebHDFS – HTTP REST Access to HDFS

Motivation Apache Hadoop provides a high performance native protocol for accessing HDFS. While this is great for Hadoop applications running inside a Hadoop cluster, users often want to connect to HDFS from the outside. For examples, some applications have to load data in and out of the cluster, or to interact with the data stored […]

by Nicholas Sze 4 min read

Apache Hadoop Apache HDFS

Filter By