Category Archives: Hadoop

YuniKorn: a universal resource scheduler

Categories: Cloud Hadoop YARN

Hello world, it’s been a while!

We are super excited today to announce the open-sourcing of one of the exciting new projects we’ve been working behind the scenes at the intersection of big-data and computation platforms – YuniKorn!

Yunikorn is a new standalone universal resource-scheduler responsible for allocating/managing resources for big-data workloads including batch jobs and long-running services.

Let’s dive right in!

Introduction

YuniKorn is a light-weight,

Read more

Small Files, Big Foils: Addressing the Associated Metadata and Application Challenges

Categories: Hadoop HDFS

Small files are a common challenge in the Apache Hadoop world and when not handled with care, they can lead to a number of complications. The Apache Hadoop Distributed File System (HDFS) was developed to store and process large data sets over the range of terabytes and petabytes. However, HDFS stores small files inefficiently, leading to inefficient Namenode memory utilization and RPC calls, block scanning throughput degradation, and reduced application layer performance. In this blog post,

Read more

Partition Management in Hadoop

Categories: Hadoop Hive

Guest blog post written by Adir Mashiach

In this post I’ll talk about the problem of Hive tables with a lot of small partitions and files and describe my solution in details.

partition management in hadoop

A little background

In my organization,  we keep a lot of our data in HDFS. Most of it is the raw data but a significant amount is the final product of many data enrichment processes.

Read more

YARN FairScheduler Preemption Deep Dive

Categories: Hadoop YARN

The multi-part blog post Untangling Apache Hadoop YARN provided an overview of how the YARN scheduler works. In this post we discuss technical details around how FairScheduler Preemption works and best practices to consider when configuring it.

We also present a recent overhaul of FairScheduler Preemption in CDH 5.11 which attempts to address a number of issues as documented in YARN-4752.

Definitions

Before we begin,

Read more

Deploy Cloudera EDH Clusters Like a Boss Revamped – Part 2

Categories: CDH Hadoop HDFS

In Part 1: Infrastructure Considerations in this three part revamped series on deploying clusters like a boss, we provided a general explanation for how nodes are classified, disk layout configurations and network topologies to think about when deploying your clusters.

In this Part 2: Service and Role Layouts segment of the series, we take a step higher up the stack looking at the various services and roles that make up your Cloudera Enterprise deployment.  

Read more