Tag Archives: Hive

Sending Files to Remote Task Nodes with Hadoop MapReduce

Categories: Hadoop MapReduce

It is common for a MapReduce program to require one or more files to be read by each map or reduce task before execution. For example, you may have a lookup table that needs to be parsed before processing a set of records. To address this scenario, Hadoop’s MapReduce implementation includes a distributed file cache that will manage copying your file(s) out to the task execution nodes.

The DistributedCache was introduced in Hadoop 0.7.0;

Read more

Thrift, Scribe, Hive, and Cassandra: Open Source Data Management Software

Categories: General

Apache Hadoop exists within a rich ecosystem of tools for processing and analyzing large data sets. At Facebook, my previous employer, we contributed a few projects of note to this ecosystem, all under the Apache 2.0 license:

    • Thrift: A cross-language RPC framework that powers many of Facebook’s services, include search, ads, and chat. Among other things, Thrift defines a compact binary serialization format that is often used to persist data structures for later analysis.

    Read more