Tag Archives: log

Job Scheduling in Apache Hadoop

Categories: Hadoop MapReduce

(guest blog post by Matei Zaharia)

When Apache Hadoop started out, it was designed mainly for running large batch jobs such as web indexing and log mining. Users submitted jobs to a queue, and the cluster ran them in order. However, as organizations placed more data in their Hadoop clusters and developed more computations they wanted to run, another use case became attractive: sharing a MapReduce cluster between multiple users.

Read more

Configuring and Using Scribe for Hadoop Log Collection

Categories: Data Ingestion

As promised in my post about installing Scribe for log collection, I’m going to cover how to configure and use Scribe for the purpose of collecting Hadoop logs.  In this post I’ll describe how to create the Scribe Thrift client for use in Java, add a new log4j Appender to Hadoop, configure Scribe, and collect logs from each node in a Hadoop cluster. At the end of the post, I will link to all source and configuration files mentioned in this guide.

Read more

Installing Scribe For Log Collection

Categories: Data Ingestion

Scribe is a newly released log collection tool that dumps log files from various nodes in a cluster to Scribe servers, where the logs are stored for further use.  Facebook describes their usage of Scribe by saying, “[Scribe] runs on thousands of machines and reliably delivers tens of billions of messages a day.”  It turns out that Scribe is rather difficult to install, so the hope of this post is to help those of you attempting to install Scribe.

Read more

Thrift, Scribe, Hive, and Cassandra: Open Source Data Management Software

Categories: General

Apache Hadoop exists within a rich ecosystem of tools for processing and analyzing large data sets. At Facebook, my previous employer, we contributed a few projects of note to this ecosystem, all under the Apache 2.0 license:

    • Thrift: A cross-language RPC framework that powers many of Facebook’s services, include search, ads, and chat. Among other things, Thrift defines a compact binary serialization format that is often used to persist data structures for later analysis.

    Read more