Category Archives: HDFS

Project Rhino Goal: At-Rest Encryption for Apache Hadoop

Categories: HBase HDFS Platform Security & Cybersecurity

An update on community efforts to bring at-rest encryption to HDFS — a major theme of Project Rhino.

Encryption is a key requirement for many privacy and security-sensitive industries, including healthcare (HIPAA regulations), card payments (PCI DSS regulations), and the US government (FISMA regulations).

Although network encryption has been provided in the Apache Hadoop platform for some time (since Hadoop 2.02-alpha/CDH 4.1), at-rest encryption,

Read More

How-to: Use Kite SDK to Easily Store and Configure Data in Apache Hadoop

Categories: HBase HDFS How-to Kite SDK

Organizing your data inside Hadoop doesn’t have to be hard — Kite SDK helps you try out new data configurations quickly in either HDFS or HBase.

Kite SDK is a Cloudera-sponsored open source project that makes it easier for you to build applications on top of Apache Hadoop. Its premise is that you shouldn’t need to know how Hadoop works to build your application on it, even though that’s an unfortunately common requirement today (because the Hadoop APIs are low-level;

Read More

A Guide to Checkpointing in Hadoop

Categories: Hadoop HDFS Ops and DevOps

Understanding how checkpointing works in HDFS can make the difference between a healthy cluster or a failing one.

Checkpointing is an essential part of maintaining and persisting filesystem metadata in HDFS. It’s crucial for efficient NameNode recovery and restart, and is an important indicator of overall cluster health. However, checkpointing can also be a source of confusion for operators of Apache Hadoop clusters.

In this post, I’ll explain the purpose of checkpointing in HDFS,

Read More

Apache Hadoop 2.3.0 is Released (HDFS Caching FTW!)

Categories: Community Hadoop HDFS Impala

Hadoop 2.3.0 includes hundreds of new fixes and features, but none more important than HDFS caching.

The Apache Hadoop community has voted to release Hadoop 2.3.0, which includes (among many other things):

  • In-memory caching for HDFS, including centralized administration and management
  • Groundwork for future support of heterogeneous storage in HDFS
  • Simplified distribution of MapReduce binaries via the YARN Distributed Cache

You can read the release notes here.

Read More

Apache Hadoop 2 is Here and Will Transform the Ecosystem

Categories: Community Hadoop HDFS YARN

The release of Apache Hadoop 2, as announced today by the Apache Software Foundation, is an exciting one for the entire Hadoop ecosystem.

Cloudera engineers have been working hard for many months with the rest of the vast Hadoop community to ensure that Hadoop 2 is the best it can possibly be, for the users of Cloudera’s platform as well as all Hadoop users generally. Hadoop 2 contains many major advances,

Read More