Category Archives: Use Case

Offheap Read-Path in Production – The Alibaba story

Categories: Hadoop HBase Performance Use Case

This article is syndicated with permission from the Apache HBase blog and highlights a collaboration between our partners at Intel and Alibaba engineering in time for “Singles Day“, the biggest shopping day on the net. For more on HBase, mark your calendars! On June 12th, 2017 the Apache HBase community will be hosting their annual HBaseCon.

Introduction

HBase is the core storage system in Alibaba’s Search Infrastructure.

Read more

How To Set Up a Shared Amazon RDS as Your Hive Metastore

Categories: Cloud Hadoop Hive How-to Impala Spark Use Case

Before CDH 5.10, every CDH cluster had to have its own Apache Hive Metastore (HMS) backend database. This model is ideal for clusters where each cluster contains the data locally along with the metadata. In the cloud, however, many CDH clusters run directly on a shared object store (like Amazon S3), making it possible for the data to live across multiple clusters and beyond any cluster’s lifespan. In this scenario clusters need to regenerate and coordinate metadata for the underlying shared data individually.

Read more

Analyzing US flight data on Amazon S3 with sparklyr and Apache Spark 2.0

Categories: CDH Data Science Hadoop Spark Use Case

We posted several blog posts about sparklyr (introduction, automation), which enables you to analyze big data leveraging Apache Spark seamlessly with R. sparklyr, developed by RStudio, is an R interface to Spark that allows users to use Spark as the backend for dplyr, which is the popular data manipulation package for R.

If you are interested in sparklyr, you can learn how to use it with the official document,

Read more

Up and running with Apache Spark on Apache Kudu

Categories: CDH Data Ingestion Data Science General Hadoop How-to Impala Kudu Spark Training Use Case

After the GA of Apache Kudu in Cloudera CDH 5.10, we take a look at the Apache Spark on Kudu integration, share code snippets, and explain how to get up and running quickly, as Kudu is already a first-class citizen in Spark’s ecosystem.

 

As the Apache Kudu development team celebrates the initial 1.0 release launched on September 19, and the most recent 1.2.0 version now GA as part of Cloudera’s CDH 5.10 release,

Read more

Resource Management for Apache Impala (incubating)

Categories: CDH Cloudera Manager Hadoop Impala Ops and DevOps Use Case

Apache Impala (incubating) includes several features that allow you to restrict or allocate resources so as to maximize stability and performance for your Impala workloads. You can limit both CPU and memory resources used by Impala to manage and prioritize jobs on CDH clusters. This blog post describes the techniques a typical Impala deployment can use to manage its resources.

Static Service Pools

Static service pools isolate services from one another, so that a high load on one service has limited impact on other services.

Read more