Category Archives: Ops and DevOps

How-to: Use Vagrant to Set Up a Virtual Hadoop Cluster (updated for CDH 5)

Categories: CDH Cloudera Manager Guest Ops and DevOps

This guest post, which is now updated for CDH 5, comes to us from David Greco.

Vagrant is a very nice tool for programmatically managing many virtual machines (VMs) on a single physical machine. It natively supports VirtualBox and also provides plugins for VMware Fusion and Amazon EC2, supporting the management of VMs in those environments as well.

Vagrant provides a very easy-to-use, Ruby-based internal DSL that allows the user to define one or more virtual machines together with their configuration parameters.

Read more

How-to: Create a CDH Cluster on Amazon EC2 via Cloudera Manager

Categories: CDH Cloud Cloudera Manager How-to Impala Ops and DevOps

Editor’s Note (added Feb. 25, 2015): For releases beyond 4.5, Cloudera recommends the use of Cloudera Director for deploying CDH in cloud environments.¬†

Cloudera Manager includes a new express installation wizard for Amazon Web Services (AWS) EC2. Its goal is to enable Cloudera Manager users to provision CDH clusters and Cloudera Impala (the open source distributed query engine for Apache Hadoop) on EC2 as easily as possible (for testing and development purposes only,

Read more

Exploring Compression for Hadoop: One DBA’s Story

Categories: General Guest HDFS Ops and DevOps

This guest post comes to us courtesy of Gwen Shapira (@gwenshap), a database consultant for The Pythian Group (and an Oracle ACE Director).

Most western countries use street names and numbers to navigate inside cities. But in Japan, where I live now, very few streets have them.

Sometimes solving technical problems is similar to navigating a city without many street names: Once you arrive at the desired location,

Read more

How-to: Automate Your Cluster with Cloudera Manager API

Categories: Cloudera Manager Hadoop How-to MapReduce Ops and DevOps Tools

API access was a new feature introduced in Cloudera Manager 4.0 (download free edition here.). Although not visible in the UI, this feature is very powerful, providing programmatic access to cluster operations (such as configuration and restart) and monitoring information (such as health and metrics). This article walks through an example of setting up a 4-node HDFS and MapReduce cluster via the Cloudera Manager (CM) API.

Cloudera Manager API Basics

The CM API is an HTTP REST API,

Read more

What Do Real-Life Apache Hadoop Workloads Look Like?

Categories: CDH Hadoop HBase HDFS Hive MapReduce Oozie Ops and DevOps Pig Testing Use Case

Organizations in diverse industries have adopted Apache Hadoop-based systems for large-scale data processing. As a leading force in Hadoop development with customers in half of the Fortune 50 companies, Cloudera is in a unique position to characterize and compare real-life Hadoop workloads. Such insights are essential as developers, data scientists, and decision makers reflect on current use cases to anticipate technology trends.

Recently we collaborated with researchers at UC Berkeley to collect and analyze a set of Hadoop traces.

Read more