Author Archives: Yongjun Zhang

Hadoop Delegation Tokens Explained

Categories: CDH Hadoop HDFS Platform Security & Cybersecurity

Apache Hadoop’s security was designed and implemented around 2009, and has been stabilizing since then. However, due to a lack of documentation around this area, it’s hard to understand or debug when problems arise. Delegation tokens were designed and are widely used in the Hadoop ecosystem as an authentication method. This blog post introduces the concept of Hadoop Delegation Tokens in the context of Hadoop Distributed File System (HDFS) and Hadoop Key Management Server (KMS),

Read more

DistCp Performance Improvements in Apache Hadoop

Categories: CDH Hadoop HDFS Performance Tools

Recent improvements to Apache Hadoop’s native backup utility, which are now shipping in CDH, make that process much faster.

DistCp is a popular tool in Apache Hadoop for periodically backing up data across and within clusters. (Each run of DistCp in the backup process is referred to as a backup cycle.) Its popularity has grown in popularity despite relatively slow performance.

In this post, we’ll provide a quick introduction to DistCp.

Read more

Understanding HDFS Recovery Processes (Part 2)

Categories: Hadoop HDFS

Having a good grasp of HDFS recovery processes is important when running or moving toward production-ready Apache Hadoop. In the conclusion to this two-part post, pipeline recovery is explained.

An important design requirement of HDFS is to ensure continuous and correct operations that support production deployments. For that reason, it’s important for operators to understand how HDFS recovery processes work. In Part 1 of this post, we looked at lease recovery and block recovery.

Read more

Understanding HDFS Recovery Processes (Part 1)

Categories: Hadoop HDFS

Having a good grasp of HDFS recovery processes is important when running or moving toward production-ready Apache Hadoop.

An important design requirement of HDFS is to ensure continuous and correct operations to support production deployments. One particularly complex area is ensuring correctness of writes to HDFS in the presence of network and node failures, where the lease recovery, block recovery, and pipeline recovery processes come into play. Understanding when and why these recovery processes are called,

Read more