Category Archives: CDH

Process a Million Songs with Apache Pig

Categories: CDH Community MapReduce Pig

The following is a guest post kindly offered by Adam Kawa, a 26-year old Hadoop developer from Warsaw, Poland. This post was originally published in a slightly different form at his blog, Hakuna MapData!

Recently I have found an interesting dataset, called Million Song Dataset (MSD), which contains detailed acoustic and contextual data about a million songs. For each song we can find information like title, hotness, tempo,

Read more

Cloudera Software Engineer Eli Collins on Apache Hadoop and CDH4

Categories: CDH Hadoop HBase HDFS MapReduce Pig

In June 2012, Eli Collins (@elicollins), from Cloudera’s Platforms team, led a session at QCon New York 2012 on the subject “Introducing Apache Hadoop: The Modern Data Operating System.” During the conference, the QCon team had an opportunity to interview Eli about several topics, including important things to know about CDH4, main differences between MapReduce 1.0 and 2.0, Hadoop use cases, and more. It’s a great primer for people who are relatively new to Hadoop.

Read more

CDH3 update 5 is now available

Categories: Avro CDH Community Flume General Hadoop HBase HDFS Hive MapReduce Oozie Pig Sqoop ZooKeeper

We are happy to announce the general availability of CDH3 update 5. This update is a maintenance release of CDH3 platform and provides a considerable amount of bug-fixes and stability enhancements. Alongside these fixes, we have also included a few new features, most notable of which are the following:

  • Apache Flume 1.2.0 – Provides a durable file channel and many more features over the previous release.
  • Hive AvroSerDe – Replaces the Haivvreo SerDe and provides robust support for Avro data format.

Read more

HttpFS for CDH3 – The Apache Hadoop FileSystem over HTTP

Categories: CDH General HDFS

HttpFS is an HTTP gateway/proxy for Apache Hadoop FileSystem implementations. HttpFS comes with CDH4 and replaces HdfsProxy (which only provided read access). Its REST API is compatible with WebHDFS (which is included in CDH4 and the upcoming CDH3u5).

HttpFs is a proxy so, unlike WebHDFS, it does not require clients be able to access every machine in the cluster. This allows clients to to access a cluster that is behind a firewall via the WebHDFS REST API.

Read more

Experimenting with MapReduce 2.0

Categories: CDH General MapReduce

In Building and Deploying MR2, we presented a brief introduction to MapReduce in Apache Hadoop 0.23 and focused on the steps to setup a single-node cluster. In MapReduce 2.0 in Hadoop 0.23, we discussed the new architectural aspects of the MapReduce 2.0 design. This blog post highlights the main issues to consider when migrating from MapReduce 1.0 to MapReduce 2.0. Note that both MapReduce 1.0 and MapReduce 2.0 are included in CDH4.

Read more