Analyzing Apache logs with Apache Pig

(guest blog post by Dmitriy Ryaboy)

A number of organizations donate server space and bandwidth to the Apache Foundation; when you download Apache Hadoop, Tomcat, Maven, CouchDB, or any of the other great Apache projects, the bits are sent to you from a large list of mirrors. One of the ways in which Cloudera supports the open source community is to host such a mirror.

In this blog post, we will use Pig to examine the download logs recorded on our server, demonstrating several features that are often glossed over in introductory Pig tutorials—parameter substitution in PigLatin scripts, Pig Streaming, and the use of custom loaders and user-defined functions (UDFs). It’s worth mentioning here that, as of last week, the Cloudera Distribution for Hadoop includes a package for Pig version 0.2 for both Red Hat and Ubuntu, as promised in an earlier post. It’s as simple as apt-get install pig or yum install hadoop-pig.

There are many software packages that can do this kind of analysis automatically for you on average-sized log files, of course. However, many organizations log so much data and require such custom analytics that these ordinary approaches cease to work. Hadoop provides a reliable method for scaling storage and computation; PigLatin provides an expressive and flexible language for data analysis.

Our log files are in Apache’s standard CombinedLogFormat. It’s a tad more complicated to parse than tab- or comma- delimited files, so we can’t just use the built-in PigLoader().  Luckily, there is already a custom loader in the Piggybank built specifically for parsing these kinds of logs.

First, we need to get the PiggyBank from Apache. The PiggyBank is a collection of useful add-ons (UDFs) for Pig, contributed by the Pig user community. There are instructions on the Pig website for downloading and compiling the PiggyBank. Note that you will need to make sure to add pig.jar to your CLASSPATH environment variable before running ant.

Now, we can start our PigLatin script by registering the piggybank jarfile and defining references to methods we will be using.

 register /home/dvryaboy/src/pig/trunk/piggybank.jar; DEFINE LogLoader org.apache.pig.piggybank.storage.apachelog.CombinedLogLoader(); DEFINE DayExtractor org.apache.pig.piggybank.evaluation.util.apachelogparser.DateExtractor('yyyy-MM-dd'); 

By the way — the PiggyBank contains another useful loader, called MyRegExLoader, which can be instantiated with any regular expression when you declare it with a DEFINE statement. Useful in a pinch.

While we are working on our script, it may be useful to run in local mode, only reading a small sample data set (a few hundred lines). In production we will want to run on a different file. Moreover, if we like the reports enough to automate them, we may wish to run the report every day, as new logs come in. This means we need to parameterize the source data location. We will also be using a database that maps geographic locations to IPs, and we probably want to parametrize that as well.

 %default LOGS 'access_log.small' %default GEO 'GeoLiteCity.dat' 

To specify a different value for a parameter, we can use the -param flag when launching the pig script:

# pig -x mapreduce -f scripts/blogparse.pig -param LOGS='/mirror.cloudera.com/logs/access_log.*'

For mapping IPs to geographic locations, we use a third-party database from MaxMind.  This database maps IP ranges to countries, regions, and cities.  Since the data from MaxMind lists IP ranges, and our logs list specific IPs, a regular join won’t work for our purposes. Instead, we will write a simple script that takes a parsed log as input, looks up the geo information using MaxMind’s Perl module, and outputs the log with geo data prepended.

The script itself is simple — it reads in a tuple representing a parsed log record, checks the first field (the IP) against the database, and prints the data back to STDOUT :

 #!/usr/bin/env perl use warnings; use strict; use Geo::IP::PurePerl;

my ($path)=shift; my $gi = Geo::IP::PurePerl->new($path);

while (<>) { chomp; if (/([^\t]*)\t(.*)/) { my ($ip, $rest) = ($1, $2); my ($country_code, undef, $country_name, $region, $city) = $gi->get_city_record($ip); print join("\t", $country_code||'', $country_name||'', $region||'', $city||'', $ip, $rest), "\n"; } } 

Getting this script into Pig is a bit more interesting. The Pig Streaming interface provides us with a simple way to ship scripts that will process data, and cache any necessary objects (such as the GeoLiteCity.dat file we downloaded from MaxMind).  However, when the scripts are shipped, they are simply dropped into the current working directory. It is our responsibility to ensure that all dependencies—such as the Geo::IP::PurePerl module—are satisfied. We could install the module on all the nodes of our cluster; however, this may not be an attractive option. We can ship the module with our script—but in Perl, packages are represented by directories, so just dropping the .pm file into cwd will not be sufficient, and Pig doesn’t let us ship directory hierarchies.  We solve this problem by packing the directory into a tarball, and writing a small Bash script called “ipwrapper.sh” that will set up our Perl environment when invoked:

 #!/usr/bin/env bash tar -xzf geo-pack.tgz PERL5LIB=$PERL5LIB:$(pwd) ./geostream.pl $1 

The geo-pack.tgz tarball simply contains geostream.pl and Geo/IP/PurePerl.pm .

We also want to make the GeoLiteCity.dat file available to all of our nodes. It would be inefficient to simply drop the file in HDFS and reference it directly from every mapper, as this would cause unnecessary network traffic.  Instead, we can instruct Pig to cache a file from HDFS locally, and use the local copy.

We can relate all of the above to Pig in a single instruction:

 DEFINE iplookup ipwrapper.sh $GEO ship ('ipwrapper.sh') cache('/home/dvryaboy/tmp/$GEO#$GEO'); 

We can now write our main Pig script. The objective here is to load the logs, filter out obviously non-human traffic, and using the rest, calculate the distribution of downloads by country and by Apache project.

Load the logs:

 logs = LOAD '$LOGS' USING LogLoader as (remoteAddr, remoteLogname, user, time, method, uri, proto, status, bytes, referer, userAgent); 

Filter out records that represent non-humans (Googlebot and such), aren’t Apache-related, or just check the headers and do not download contents.

 logs = FILTER logs BY bytes != '-' AND uri matches '/apache.*';

-- project just the columns we will need logs = FOREACH logs GENERATE remoteAddr, DayExtractor(time) as day, uri, bytes, userAgent;

-- The filtering function is not actually in the PiggyBank. -- We plan on contributing it soon. notbots = FILTER logs BY (NOT org.apache.pig.piggybank.filtering.IsBotUA(userAgent)); 

Get country information, group by country code, aggregate.

 with_country = STREAM notbots THROUGH ipwrapper.sh $GEO AS (country_code, country, state, city, ip, time, uri, bytes, userAgent);

geo_uri_groups = GROUP with_country BY country_code;

geo_uri_group_counts = FOREACH geo_uri_groups GENERATE group, COUNT(with_country) AS cnt, SUM(with_country.bytes) AS total_bytes;

geo_uri_group_counts = ORDER geo_uri_group_counts BY cnt DESC;

STORE geo_uri_group_counts INTO 'by_country.tsv'; 

The first few rows look like:

Country Hits Bytes
USA 8906 2.0458781232E10
India 3930 1.5742887409E10
China 3628 1.6991798253E10
Mexico 595 1.220121453E9
Colombia 259 5.36596853E8

At this point, the data is small enough to plug into your favorite visualization tools. We wrote a quick-and-dirty python script to take logarithms and use the Google Chart API to draw this map:

Bytes by Country

This is pretty interesting. Let’s do a breakdown by US states.

Note that with the upcoming Pig 0.3 release, you will be able to have multiple stores in the same script, allowing you to re-use the loading and filtering results from earlier steps. With Pig 0.2, this needs to go in a separate script, with all the required DEFINEs, LOADs, etc.

 us_only = FILTER with_country BY country_code == 'US';

by_state = GROUP us_only BY state;

by_state_cnt = FOREACH by_state GENERATE group, COUNT(us_only.state) AS cnt, SUM(us_only.bytes) AS total_bytes;

by_state_cnt = ORDER by_state_cnt BY cnt DESC;

store by_state_cnt into 'by_state.tsv'; 

Theoretically, Apache selects an appropriate server based on the visitor’s location, so our logs should show a heavy skew towards California. Indeed, they do (recall that the intensity of the blue color is based on a log-scale).

Bytes by US State

Now, let’s get a breakdown by project. To get a rough mapping of URI to Project, we simply get the directory name after /apache in the URI. This is somewhat inaccurate, but good for quick prototyping. This time around, we won’t even bother writing a separate script — this is a simple awk job, after all! Using streaming, we can process data the same way we would with basic Unix utilities connected by pipes.

 uris = FOREACH notbots GENERATE uri;

-- note that we have to escape the dollar sign for $3, -- otherwise Pig will attempt to interpret this as a Pig variable. project_map = STREAM uris THROUGH awk -F '/' '{print \$3;}' AS (project);

project_groups = GROUP project_map BY project;

project_count = FOREACH project_groups GENERATE group, COUNT(project_map.project) AS cnt;

project_count = ORDER project_count BY cnt DESC;

STORE project_count INTO 'by_project.tsv'; 

We can now take the by_project.tsv file and plot the results (in this case, we plotted the top 18 projects, by number of downloads).
Downloads by Project

We can see that Tomcat and Httpd dwarf the rest of the projects in terms of file downloads, and the distribution appears to follow a power-law.

We’d love to hear how folks are using Pig to analyze their data. Drop us a line, or comment below!

Filed under:

9 Responses
  • David Gwartney / June 18, 2009 / 12:30 PM

    This is great information. Its answered my question regarding streaming inside PIG that came up the other day.

    Thanks for the info.

    Dave

  • Joshua Barratt / June 28, 2009 / 11:35 AM

    Great article, have been looking at needing to process web logs on a large scale already and this is a very useful example.

    For the ‘packing and shipping a perl script’ part, check out Par::Packer.

    It allows you to turn a perl script and all it’s dependencies into a single monolithic script file.

  • Mike / January 25, 2011 / 3:58 PM

    What’s the status of “org.apache.pig.piggybank.filtering” ? I don’t see it yet in piggybank, as of release 0.8

  • androm / July 11, 2011 / 2:02 PM

    Hi I was able to run this in local mode; but when I tried to run this in hadoop mode, I received the error msg “ERROR 2055: Received Error while processing the map plan: ‘ipwrapper.sh GeoLiteCity.dat ‘ failed with exit status: 2″

    Any ideas? Thanks!

  • Dam / September 16, 2011 / 8:29 AM

    for my shiped script I build an auto extract bash script done this way (beside geostream.pl and its dependances)
    wrapper.sh.template ( 755 mode ) :
    #!/bin/bash
    sed ’0,/^__ARCHIVE_BELOW__$/d’ $0 | tar xj
    export PERL5LIB=$PERL5LIB:$(pwd)
    ./geostream.pl $1
    rm -rf ./*
    exit 0

    __ARCHIVE_BELOW__

    And then I build my script with this command :
    cp wrapper.sh.template wrapper.sh; tar cjf – Geo geostream.pl >> wrapper.sh

    then I can use it in my pig script :
    DEFINE iplookup wrapper.sh $GEO
    ship (‘wrapper.sh’)
    cache(‘/GeoIP/$GEO#$GEO’);

    The data file /GeoIP/GeoLiteCity.dat is stored on my hdfs storage. and copied with cache.

Leave a comment


eight − = 0