Building a distributed concurrent queue with Apache ZooKeeper

Categories: ZooKeeper

In my first few weeks here at Cloudera, I’ve been tasked with helping out with the Apache ZooKeeper system, part of the umbrella Hadoop project. ZooKeeper is a system for coordinating distributed processes. In a distributed environment, getting processes to act in any kind of synchrony is an extremely hard problem. For example, simply having a set of processes wait until they’ve all reached the same point in their execution –

Cloudera’s Distribution for Apache Hadoop: Making Hadoop Easier for a Sysadmin

Categories: Hadoop

A few weeks ago we announced Cloudera’s Distribution for Apache Hadoop, and I want to spend some time showing how our distribution makes a sysadmin’s job a little easier.

Perhaps the most useful features in our distribution, at least for sysadmins, are RPM packages and init scripts.  RPMs are the standard way of installing software on a Red Hat Linux distribution (RHEL, Fedora Core, CentOS).  They give sysadmins a one-command install,

Sending Files to Remote Task Nodes with Hadoop MapReduce

Categories: Hadoop MapReduce

It is common for a MapReduce program to require one or more files to be read by each map or reduce task before execution. For example, you may have a lookup table that needs to be parsed before processing a set of records. To address this scenario, Hadoop’s MapReduce implementation includes a distributed file cache that will manage copying your file(s) out to the task execution nodes.

The DistributedCache was introduced in Hadoop 0.7.0;

