Category Archives: CDH

HttpFS for CDH3 – The Apache Hadoop FileSystem over HTTP

Categories: CDH General HDFS

HttpFS is an HTTP gateway/proxy for Apache Hadoop FileSystem implementations. HttpFS comes with CDH4 and replaces HdfsProxy (which only provided read access). Its REST API is compatible with WebHDFS (which is included in CDH4 and the upcoming CDH3u5).

HttpFs is a proxy so, unlike WebHDFS, it does not require clients be able to access every machine in the cluster. This allows clients to to access a cluster that is behind a firewall via the WebHDFS REST API.

Read more

Experimenting with MapReduce 2.0

Categories: CDH General MapReduce

In Building and Deploying MR2, we presented a brief introduction to MapReduce in Apache Hadoop 0.23 and focused on the steps to setup a single-node cluster. In MapReduce 2.0 in Hadoop 0.23, we discussed the new architectural aspects of the MapReduce 2.0 design. This blog post highlights the main issues to consider when migrating from MapReduce 1.0 to MapReduce 2.0. Note that both MapReduce 1.0 and MapReduce 2.0 are included in CDH4.

Read more

Watching the Clock: Cloudera’s Response to Leap Second Troubles

Categories: CDH Cloudera Manager Community General Hadoop Support Testing

At 5 pm PDT on June 30, a leap second was added to the Universal Coordinated Time (UTC). Within an hour, Cloudera Support started receiving reports of systems running at 100% CPU utilization. The Support Team worked quickly to understand and diagnose the problem and soon published a solution. Bugs due to the leap second coupled with the Amazon Web Services outage would make this Cloudera’s busiest support weekend to date.

Since Hadoop is written in Java and closely interoperates with the underlying OS,

Read more

Apache HBase Write Path

Categories: CDH General HBase

Apache HBase is the Hadoop database, and is based on the Hadoop Distributed File System (HDFS). HBase makes it possible to randomly access and update data stored in HDFS, but files in HDFS can only be appended to and are immutable after they are created.  So you may ask, how does HBase provide low-latency reads and writes? In this blog post, we explain this by describing the write path of HBase —

Read more

The Elephant in the Enterprise

Categories: CDH Community General

On Tuesday, June 12th The Churchill Club of Silicon Valley hosted a panel discussion on Hadoop’s evolution from an open-source project to becoming a standard component of today’s enterprise computing fabric. The lively and dynamic discussion was moderated by Cade Metz, Editor, Wired Enterprise.

Panelists included:

Michael Driscoll, CEO, Metamarkets
Andrew Mendelsohn, SVP, Oracle Server Technologies
Mike Olson, CEO, Cloudera
Jay Parikh,

Read more