Apache Hadoop for Archiving Email

This post will explore a specific use case for Apache Hadoop, one that is not commonly recognized, but is gaining interest behind the scenes. It has to do with converting, storing, and searching email messages using the Hadoop platform for archival purposes.

Most of us in IT/Datacenters know the challenges behind storing years of corporate mailboxes and providing an interface for users to search them as necessary.  The sheer volume of messages, the content structure and its complexity, the migration processes, and the need to provide timely search results stand out as key points that must be addressed before embarking on an actual implementation.  For example, in some organizations all email messages are stored in production servers; others just create a backup dump and store them in tapes; and some organizations have proper archival processes that include search features. Regardless of the situation, it is essential to be able to store and search emails because of the critical information they hold as well as for legal compliance, investigation, etc. That said, let’s look at how Hadoop could help make this process somewhat simple, cost effective, manageable, and scalable.

Big files are ideal for Hadoop. It can store them, provide fault tolerance, and process them in parallel when needed. As such, the first step in the journey to an email archival solution is to convert your email messages to large files. In this case we will convert them to flat files called sequence files, composed of binary key-value pairs.  One way to accomplish this is to:

  • Put all the individual email message files into a folder in HDFS
  • Use something like WholeInputFileFormat/WholeFileRecordReader (as described in Tom White’s book for small file conversion) to read the contents of the file as its value, and the file name as the key (see Figure 1. Message Files to Sequence File),
  • Use IdentityReducer (outputs input values directly to output) to write it into sequence files.

(Note: Depending on where your data is and bandwidth bottlenecks, simply spawning standalone JVMS to create sequence files directly in HDFS could also be an option. )

If you are dealing with millions of files, one way of sharing (partitioning them) would be to create sequence files by day/week/month, depending on how many email messages there are in your organization. This will limit the number of message files you need to put into HDFS to something that is more suitable, 1-2 million at a time given the NameNode memory footprint of each file. Once in HDFS, and converted into sequence files, you can delete the original files and proceed to the next batch.  Here is what a very basic map method in a Mapper class could look like; all it does is emit the filename as key and binary bytes as value.

1

Here is what the main driver looks like,

2

3

4

5

Code Walkthrough:

  1. The mapper emits the filename as the key and the file content BytesWritable as value.
  2. Sets the input path (where message files are stored) and output path (where output sequence file will be saved).
  3. WholeFileInputFormat is used as input format, which in turn uses WholeFileRecordReader to read the file content in its entirety. The content of the file gets sent to mapper as the value.
  4. This block is to enable compression. Using gzip, you could get a 10:1 ratio but the file would have to be processed as a whole. With lzo the compression is about 4:1 but files can be split and processed in parallel.
  5. This is where we set the output key for mapper as Text, output value for mapper as BytesWritable, the mapper class that we created and the reducer class as IdentityReducer. Since we do not need to sort, IdentityReducer works well in this case.

Figure 1: Message Files to Sequence File

With compression turned on, you can get I/O as well as storage benefits.

So far, we have taken our message files and converted them into sequence files, taking care of the conversion as well as storage portion.  In terms of searching those files, if the email messages are Outlook messages, we can use Apache POI for Microsoft Documents to parse them to Java Objects, and search the contents of the messages to output the results, as needed.

A quick example of code to perform the search and output the results looks like this, with the mapper first, followed by driver class:

1

2

3

Code Walkthrough:

  1. BytesWritable comes in to the Mapper as the value from the sequence file, and is converted to MAPIMessage from Apache POI for MS Documents.
  2. The actual search is performed here.  In this example, it just searches for emails that contain “sales” in recipient email addresses, all of the sample emails have sales in recipient address so the results should have them all back. Of course, this would be where the bulk of the logic would go should one require extensive search features including parsing through various attachments.  In this example we write the results — the complete email message — directly to the Local Filesystem so that it can be viewed via messaging applications such as Outlook.
  3. This is the job configuration where the input format is set to SequenceFileInputFormat created above. In this case, since there is no mapper output we set the output format to NullOutputFormat.

In this example, we write the complete email out to Local Filesystem one at a time. As an alternative, we could pass the results to reducers and write all the results as Text into HDFS as well. This would depend on what the need is.

In this post I have described the conversion of email files into sequence files and store them using HDFS. I have looked at how to search through them to output results. Given the “simply add a node” scalability feature of Hadoop, it is very straightforward to add more storage as well as search capacity. Furthermore, given that Hadoop clusters are built using commodity hardware, that the software itself is open source, and that the framework makes it simple to implement specific use cases. This leads to an overall solution that is very cost effective compared to a number of existing software products that provide similar capabilities. The search portion of the solution, however, is very rudimentary. In part 2, I will look at using Lucene/Solr for indexing and searching in a more standard and robust way.

(You can get a complete code sample at: https://github.com/cloudera/emailarchive)

Filed under:

8 Responses
  • Julien Nioche / September 29, 2011 / 5:11 AM

    Interesting.

    Apache Tika wraps POI and can be used to parse emails with Hadoop (via Behemoth) as explained on http://digitalpebble.blogspot.com/2011/05/processing-enron-dataset-using-behemoth.html

    This should work for cases where the emails are not at the Outlook format + Behemoth can be used for text analysis / NLP before indexing to SOLR

  • Alex Ott / October 01, 2011 / 4:37 AM

    +1 for use of Apache Tika for parsing documents & text extraction…

  • Mark Kerzner / January 04, 2012 / 10:21 AM

    Hi, Sunil,

    this one, as well as the next posts, are very interesting, but can you tell us if you are talking about a real-life use case? What does behind-the-scenes mean, as a custom archiving solution, as an investigative tool, or something else?

    Thank you,
    Mark

  • Jeremy / November 12, 2012 / 9:07 AM

    Where can I find part 2 of this post?

Leave a comment


2 + four =