YARN FairScheduler Preemption Deep Dive

Categories: Hadoop YARN

The multi-part blog post Untangling Apache Hadoop YARN provided an overview of how the YARN scheduler works. In this post we discuss technical details around how FairScheduler Preemption works and best practices to consider when configuring it.

We also present a recent overhaul of FairScheduler Preemption in CDH 5.11 which attempts to address a number of issues as documented in YARN-4752.

Definitions

Before we begin,

Read more

Deploy Cloudera EDH Clusters Like a Boss Revamped – Part 3: Cloud Considerations

Categories: CDH

The previous two sections have concentrated on infrastructure considerations and services and role layouts for categories of workloads such as Analytic DB and Operational DB. Many of the concepts described therein apply predominantly to on-premise clusters while others apply to clusters deployed on-premise or in the cloud. This section will concentrate predominantly on those considerations that apply to cloud deployments only.

At the time of this writing, Cloudera supports 3 Infrastructure as a Service (IaaS) platforms: Amazon Elastic Compute Cloud (AWS),

Read more

Evaluating Partner Platforms

Categories: CDH Hardware How-to Performance

As a member of Cloudera’s Partner Engineering team, I evaluate hardware and cloud computing platforms offered by commercial partners who want to certify their products for use with Cloudera software. One of my primary goals is to make sure that these platforms provide a stable and well-performing base upon which our products will run, a state of operation that a wide variety of customers performing an even wider variety of tasks can appreciate.

Read more

New in Cloudera Enterprise 6.0: Analytic Search

Categories: CDH Search

It has been a long and patient wait for Apache Hadoop 3.0 to mature. A major new version of the storage layer obviously impacts all our integrated components, including Apache Solr and all our integrations with the rest of the platform, commonly referred to as Cloudera Search. Since our customers’ Search deployments are so often mission critical, we’ve made sure to take time to do extensive integration testing and focus on the upgrade experience.

Now the moment has finally come to announce Solr 7.0 in Cloudera Search and available as of our new major release,

Read more

Scalability of Kafka Messaging using Consumer Groups

Categories: Data Ingestion Flume Kafka Use Case

Traditional messaging models fall into two categories: Shared Message Queues and Publish-Subscribe models. Both models have their own pros and cons. Neither could successfully handle big data ingestion at scale due to limitations in their design. Apache Kafka implements a publish-subscribe messaging model which provides fault tolerance, scalability to handle large volumes of streaming data for real-time analytics. It was developed at LinkedIn in 2010 to meet its growing data pipeline needs. Apache Kafka bridges the gaps that traditional messaging models failed to achieve.

Read more