Category Archives: Hadoop

Small Files, Big Foils: Addressing the Associated Metadata and Application Challenges

Categories: Hadoop HDFS

Small files are a common challenge in the Apache Hadoop world and when not handled with care, they can lead to a number of complications. The Apache Hadoop Distributed File System (HDFS) was developed to store and process large data sets over the range of terabytes and petabytes. However, HDFS stores small files inefficiently, leading to inefficient Namenode memory utilization and RPC calls, block scanning throughput degradation, and reduced application layer performance. In this blog post,

Read more

Partition Management in Hadoop

Categories: Hadoop Hive

Guest blog post written by Adir Mashiach

In this post I’ll talk about the problem of Hive tables with a lot of small partitions and files and describe my solution in details.

partition management in hadoop

A little background

In my organization,  we keep a lot of our data in HDFS. Most of it is the raw data but a significant amount is the final product of many data enrichment processes.

Read more

YARN FairScheduler Preemption Deep Dive

Categories: Hadoop YARN

The multi-part blog post Untangling Apache Hadoop YARN provided an overview of how the YARN scheduler works. In this post we discuss technical details around how FairScheduler Preemption works and best practices to consider when configuring it.

We also present a recent overhaul of FairScheduler Preemption in CDH 5.11 which attempts to address a number of issues as documented in YARN-4752.

Definitions

Before we begin,

Read more

Deploy Cloudera EDH Clusters Like a Boss Revamped – Part 2

Categories: CDH Hadoop HDFS

In Part 1: Infrastructure Considerations in this three part revamped series on deploying clusters like a boss, we provided a general explanation for how nodes are classified, disk layout configurations and network topologies to think about when deploying your clusters.

In this Part 2: Service and Role Layouts segment of the series, we take a step higher up the stack looking at the various services and roles that make up your Cloudera Enterprise deployment.  

Read more

Hadoop Delegation Tokens Explained

Categories: CDH Hadoop HDFS Platform Security & Cybersecurity

Apache Hadoop’s security was designed and implemented around 2009, and has been stabilizing since then. However, due to a lack of documentation around this area, it’s hard to understand or debug when problems arise. Delegation tokens were designed and are widely used in the Hadoop ecosystem as an authentication method. This blog post introduces the concept of Hadoop Delegation Tokens in the context of Hadoop Distributed File System (HDFS) and Hadoop Key Management Server (KMS),

Read more