Apache Hadoop moves fast. Users often find that they need to upgrade after just a few months. Upgrading can be a daunting task, especially if you are several versions behind. We’ve been working with Rackspace for a while now, and they recently embarked on an upgrade from Hadoop 0.15.3 to Cloudera’s Distribution for Hadoop based on 0.18.3. Stu Hood, Search Team Technical Lead at Rackspace, was kind enough to document their experience, and we’re happy to share it with you here. Read more
In my first few weeks here at Cloudera, I’ve been tasked with helping out with the Apache ZooKeeper system, part of the umbrella Hadoop project. ZooKeeper is a system for coordinating distributed processes. In a distributed environment, getting processes to act in any kind of synchrony is an extremely hard problem. For example, simply having a set of processes wait until they’ve all reached the same point in their execution –
A few weeks ago we announced Cloudera’s Distribution for Apache Hadoop, and I want to spend some time showing how our distribution makes a sysadmin’s job a little easier.
Perhaps the most useful features in our distribution, at least for sysadmins, are RPM packages and init scripts. RPMs are the standard way of installing software on a Red Hat Linux distribution (RHEL, Fedora Core, CentOS). They give sysadmins a one-command install,
It is common for a MapReduce program to require one or more files to be read by each map or reduce task before execution. For example, you may have a lookup table that needs to be parsed before processing a set of records. To address this scenario, Hadoop’s MapReduce implementation includes a distributed file cache that will manage copying your file(s) out to the task execution nodes.
The DistributedCache was introduced in Hadoop 0.7.0;