Configuring and Using Scribe for Hadoop Log Collection

Categories: Data Ingestion

As promised in my post about installing Scribe for log collection, I’m going to cover how to configure and use Scribe for the purpose of collecting Hadoop logs.  In this post I’ll describe how to create the Scribe Thrift client for use in Java, add a new log4j Appender to Hadoop, configure Scribe, and collect logs from each node in a Hadoop cluster. At the end of the post, I will link to all source and configuration files mentioned in this guide.

What’s the advantage of collecting Hadoop’s logs?
Collecting Hadoop’s logs in to one single log file is very convenient for debugging and monitoring purposes. Essentially, what Scribe lets you do is tail -f your entire cluster, which is really cool to see :). Having one single log file on one, or even just a few, machines makes log analysis insanely simpler. Making log analysis simpler means the creation of monitoring and reporting tools just got a lot easier as well. Let’s get started …

Create the Scribe Thrift Client for Java
In order to use Scribe with Java, you need to use Thrift to create a Scribe client stub, which is a collection of generated Java files to be used by our custom log4j Appender, which will be covered later. For now, cd in to your Scribe distribution directory (I used trunk), and then cd in to the ‘if’ directory.  Now you’re in $SCRIBE_DIST/if.  Edit scribe.thrift by changing the ‘include’ line that includes fb303.thrift to the following:

Now cd in to your Thrift distribution and then in to contrib/fb303/if, which will bring you to $THRIFT_DISTRIBUTION/contrib/fb303/if. Edit fb303.thrift by changing the ‘include’ line that includes reflection_limited.thrift to the following:

Once you’ve edited each of these thrift files, return back to $SCRIBE_ROOT/if and run:

This will most likely present you with a bunch of deprecation warnings that you should ignore. The important thing is that you now have a folder called ‘gen-java’ that contains three Java classes, which all encapsulate the Scribe Thrift Java client.

The last steps are to create Thrift and fb303 jars. Let’s start with Thrift. cd to $THRIFT_HOME/lib/java and simply run ‘ant’ to build libthrift.jar. Once ant finishes with no errors, libthrift.jar should appear in the folder you ran ant from. As for fb303, cd to $THRIFT_HOME/contrib/fb303/java and run ‘ant’ to build libfb303.jar. Once ant finishes with no errors, libfb303.jar should appear in $THRIFT_HOME/contrib/fb303/java/build/lib, not in the same directory that ant was ran from. Copy libthrift.jar and libfb303.jar to $HADOOP_HOME/lib, where they can be used by our custom log4j Appender.

Create a custom log4j Appender to write logs to Scribe
To create a custom Appender, you must extend org.apache.log4j.AppenderSkeleton and overwrite the following methods: activateOptions(), append(LoggingEvent event), close(), and requiresLayout(). Learn more about these methods’ signatures here. Reminder: I’ve posted a working Appender at the end of this post if you’d like to see the entire source code. activateOptions() should look something like this:

Note that you’ll need this in a try-catch block. The activateOptions() method initializes the Appender. append(LoggingEvent event) should look something like this:

Again, you’ll need this in a try-catch block. append(LoggingEvent event) actually does the logging. You can customize this message as much as you would like. One essential modification is to include the class that issued the log and the hostname of the machine that it occurred on. The issuing class is easy: look in to the getLocationInformation() method mentioned here. As for the hostname, I was unable to fetch the hostname programmatically (I kept getting a null hostname), so I had to choose an alternative solution. I’ll talk more about this in the next section.

Finally, close() should have a single line: this.transport.close(), and requiresLayout() should return true;.

Configure log4j to use your Appender
Now that we have an Appender that will write to Scribe, we need to tell log4j to use it. Edit $HADOOP_CONF/ (probably $HADOOP_ROOT/conf/, and add the following lines:

Then, ensure that “scribe” is in your “hadoop.root.logger” variable:

and also ensure that “scribe” is in your “log4j.rootLogger” variable:

Note: this last change may be redundant, but I did it anyway.

Essentially what we just did was define a new logging output, and we called it “scribe.” We told log4j what Appender to use for Scribe, and we told our Scribe Appender what the hostname of this node is (see more info below). We also told log4j the log format that we’re interested in.

The following line calls our Appender’s setHostname(String hostname) method, which is the (less-than-ideal) solution to getting the hostname.

This means that we need to modify our Appender to have a getter and setter for the “hostname” private field. They should be public methods, the setter taking a single argument of type String with no return type, and the getter taking no arguments, returning a String.

Configuring and running Scribe
So let’s reiterate what we just did. We built a Scribe Thrift client that we use in our custom log4j Appender. We then configured log4j to use this new Appender. Lastly, we just need to configure Scribe and get it running.

Working with the Scribe examples makes it very clear how one should configure Scribe. Take a look at $SCRIBE_DISTRIBUTION/examples/*.conf to get a sense of what does what. Your Scribe pipeline should be configured as follows:

Hadoop writes a log -> log4j sends this log to our Appender -> the Appender sends a Scribe message to -> the local Scribe server forwards the message to your central Scribe server on port 1464 -> the central Scribe server stores the log

Note that the local and central Scribe servers should be configured to have secondary storage. The local server should write secondary storage to its own disk, in a writable directory, and the central server should write secondary storage to /tmp, which is guaranteed to be a writable directory. Also note that you should choose a clever max_size value, one that will work conveniently with HDFS if you ever plan to write these logs to HDFS. In my example below, I set this value to 128000000 (128MB).

Once you’ve configured Scribe, run your central server on any node other than the NameNode, and run the local server on every node. Finally, bounce Hadoop and tail your central server’s log file. You’ll be able to see exactly what’s happening in your Hadoop cluster in real time. Pretty rad, huh?

This guide turned out to be much longer than I anticipated. There are a lot of files to configure, a decent amount of code to write, and lots of pieces to glue together. For your convenience, I’ve posted example source and configuration files below:

Scaling Scribe
The configuration above will not scale. First, Hadoop creates a decent amount of log data, so storing these on a local disk won’t work in the long run. Second, having one machine collect logs from many hundreds or thousands of computers will clearly not work either. In a very large cluster, many different “central” Scribe servers should be installed behind a virtual IP or a reverse proxy. In a setup like this, Scribe connections will be multiplexed to several servers, and Scribe servers can be added or removed without system failure.

If you choose to use a reverse proxy, you can poll the the fb303 counters of the central Scribe servers to perform simple load balancing. Note, however, that Scribe connections are persistent. That is, they are not created for each logged message. Instead client-server connections stay open for long periods of time, making complex control flow logic difficult.

If you have any questions, then please shoot us an email. Otherwise, expect another post soon that covers Hadoop+Scribe benchmarking.


3 responses on “Configuring and Using Scribe for Hadoop Log Collection