Tag Archives: python

State of the Elephant 2008

Categories: Community General Hadoop

It’s a new year, the time when we take a moment to look back at the previous one, and forward to what might be coming next. In the world of Hadoop a lot happened in 2008.

Organization

At the beginning of the year, Hadoop was a sub-project of Lucene. In January, Hadoop became a Top Level Project at Apache, in recognition of its success and diversity of community. This allowed sub-projects to be added,

Read more

Sending Files to Remote Task Nodes with Hadoop MapReduce

Categories: Hadoop MapReduce

It is common for a MapReduce program to require one or more files to be read by each map or reduce task before execution. For example, you may have a lookup table that needs to be parsed before processing a set of records. To address this scenario, Hadoop’s MapReduce implementation includes a distributed file cache that will manage copying your file(s) out to the task execution nodes.

The DistributedCache was introduced in Hadoop 0.7.0;

Read more

Installing Scribe For Log Collection

Categories: Data Ingestion

Scribe is a newly released log collection tool that dumps log files from various nodes in a cluster to Scribe servers, where the logs are stored for further use.  Facebook describes their usage of Scribe by saying, “[Scribe] runs on thousands of machines and reliably delivers tens of billions of messages a day.”  It turns out that Scribe is rather difficult to install, so the hope of this post is to help those of you attempting to install Scribe.

Read more