Category Archives: CDH

Meet the Engineer: Eric Sammer

Categories: CDH Cloudera Manager HBase Meet the Engineer

In this installment of “Meet the Engineer”, we meet with Eric Sammer (invariably known as just plain “Sammer”), Apache committer and author of the upcoming O’Reilly book, Hadoop Operations.

What do you do at Cloudera, and in which Apache project are you involved?

I’ve been lucky enough to be part of a few different teams at Cloudera since I joined. Almost three years ago,

Read more

What Do Real-Life Apache Hadoop Workloads Look Like?

Categories: CDH Hadoop HBase HDFS Hive MapReduce Oozie Ops and DevOps Pig Testing Use Case

Organizations in diverse industries have adopted Apache Hadoop-based systems for large-scale data processing. As a leading force in Hadoop development with customers in half of the Fortune 50 companies, Cloudera is in a unique position to characterize and compare real-life Hadoop workloads. Such insights are essential as developers, data scientists, and decision makers reflect on current use cases to anticipate technology trends.

Recently we collaborated with researchers at UC Berkeley to collect and analyze a set of Hadoop traces.

Read more

How-to: Develop CDH Applications with Maven and Eclipse

Categories: CDH How-to Tools

Learn how to configure a basic Maven project that will be able to build applications against CDH

Apache Maven is a build automation tool that can be used for Java projects. Since nearly all the Apache Hadoop ecosystem is written in Java, Maven is a great tool for managing projects that build on top of the Hadoop APIs. In this post, we’ll configure a basic Maven project that will be able to build applications against CDH (Cloudera’s Distribution Including Apache Hadoop) binaries.

Read more

Apache Hadoop on Your PC: Cloudera’s CDH4 Virtual Machine

Categories: CDH Hadoop Training

Today ZDNet has very helpfully published a guide to downloading, configuring, and using Cloudera’s Demo VM for CDH4 (available in three flavors, but in this case the VMware version). As the author, Andrew Brust, explains, the VM contains a “pre-built, training-appropriate, 1-node Apache Hadoop cluster” (on top of CentOS). Perhaps most important for boot-strappers, it’s free.

You can download the VM here – and there is a Hadoop tutorial available here. The combo will go a long way toward jump-starting explorations.

Read more

Process a Million Songs with Apache Pig

Categories: CDH Community MapReduce Pig

The following is a guest post kindly offered by Adam Kawa, a 26-year old Hadoop developer from Warsaw, Poland. This post was originally published in a slightly different form at his blog, Hakuna MapData!

Recently I have found an interesting dataset, called Million Song Dataset (MSD), which contains detailed acoustic and contextual data about a million songs. For each song we can find information like title, hotness, tempo,

Read more