Author Archives: Jeff Hammerbacher

CDH3b2 Release Recap

Categories: General

Just over a month ago, our CEO, Mike Olson, announced the availability of Cloudera’s Distribution for Hadoop (beta 2), or CDH3b2. As Charles, our head of Product Management, explained in a subsequent blog post, this release of CDH removes a lot of the complexity we’ve seen organizations encounter when deploying Hadoop within an existing data management infrastructure.

By packaging Hadoop core together with a suite of additional projects for data collection,

Read More

Introducing Cloudera Desktop

Categories: General

Today at Hadoop World NYC, we’re announcing the availability of Cloudera Desktop, a unified and extensible graphical user interface for Hadoop. The product is free to download and can be used with either internal clusters or clusters running on public clouds.

At Cloudera, we’re focused on making Hadoop easy to install, configure, manage, and use for all organizations. While there exist many utilities for developers who work with Hadoop,

Read More

Sending Files to Remote Task Nodes with Hadoop MapReduce

Categories: Hadoop MapReduce

It is common for a MapReduce program to require one or more files to be read by each map or reduce task before execution. For example, you may have a lookup table that needs to be parsed before processing a set of records. To address this scenario, Hadoop’s MapReduce implementation includes a distributed file cache that will manage copying your file(s) out to the task execution nodes.

The DistributedCache was introduced in Hadoop 0.7.0;

Read More

Thrift, Scribe, Hive, and Cassandra: Open Source Data Management Software

Categories: General

Apache Hadoop exists within a rich ecosystem of tools for processing and analyzing large data sets. At Facebook, my previous employer, we contributed a few projects of note to this ecosystem, all under the Apache 2.0 license:

    • Thrift: A cross-language RPC framework that powers many of Facebook’s services, include search, ads, and chat. Among other things, Thrift defines a compact binary serialization format that is often used to persist data structures for later analysis.

    Read More