Apache Hadoop for Archiving Email – Part 2

Part 1 of this post covered how to convert and store email messages for archival purposes using Apache Hadoop, and outlined how to perform a rudimentary search through those archives. But, let’s face it: for search to be of any real value, you need robust features and a fast response time. To accomplish this we use Solr/Lucene-type indexing capabilities on top of HDFS and MapReduce.

Before getting into indexing within Hadoop, let us review the features of Lucene and Solr:

Apache Lucene and Apache Solr

Apache Lucene is a mature, high performance, full-featured Java API used for indexing and searching that has been around since the late nineties — it supports field-specific indexing and searching, sorting, highlighting, and wildcard searches, to name only a few. Everything in Lucene boils down to creating a document using artifacts such as email messages, HTML, PDF, XML, Word, Excel, etc, the contents of which will end up being parsed and added to Lucene documents as name/value pairs.  There are a number of libraries available for extracting actual content, depending on what the artifact is. When extracting content from .msg email files, for instance, TIKA and POI are some useful libraries.

Once you have added name/value pairs from the email content to the document, the index portion is taken care of. We can then use IndexSearcher to search through the indexed contents as illustrated below, in Figure 1:

Figure 1: Indexing and Searching using Lucene.

Apache Solr, on the other hand, is a Lucene-based full text search server with XML, JSON, and HTTP APIs, which has a web admin interface and provides extensive caching, replication, search distribution, as well as the ability to add customized plugins. Solr already includes various parsing libraries, including Tika, POI, and TagSoup, among others.

Figure 2 below illustrates the Solr components and deployment architecture:

Figure 2: Solr Components and Deployment Architecture.

Next, let’s explore how you can use both Solr and Lucene within the Hadoop environment for indexing and searching massive amounts of data:

First, you need to get data into HDFS, as covered in Hadoop for Archiving Email – Part 1. Once the data is there, you can start to run MapReduce to create indexes in parallel that can then be dumped into HDFS or into a Local File System. If an index is stored within a Local File System, simply serve it from there by pointing Solr to it. It can be run either in single, replicated or distributed mode, depending on the size of the index to serve.  However, if you need to make search available for only for a small number of users, you can simply store data directly in  HDFS and provide an interface for your users to access it directly.

One tool that I find very handy for providing such an interface is Luke. Luke was built for development and diagnostic purposes, and can be used to search, display and browse the results of a Lucene index. With Luke, you can view documents, analyze results, copy and delete them, or optimize indexes that have already been built. The best part: you can easily operate on multipart indexes stored directly in HDFS, as illustrated in Figure 3:

Figure 3: Indexing and Searching within HDFS or Local Filesystem.

Having discussed design at a high level, let’s now dive deeper into the details of MapReduce for creating an index.

Here is how the configure portion of each mapper could look:

  1. Initialize the analyzer and index writer config.
  2. If writing to HDFS, you can use RAMDirectory to hold the indexes created; and once complete, flush to HDFS.
  3. If writing to a local file system, simply create FSDirectory with the location.

Having configured the mapper, let’s look at the map method:

Once it’s configured within each map, the exercise boils down to parsing the content and adding it to the writer. The above code does the following:

  1. Create a Lucene document.
  2. Parse email content passed into MAPIMessage.
  3. Extract necessary fields and add it to the index. (In this example I only extract recipient, subject and content. You can add more fields as necessary and use Hbase to store the content if needed.)

At this point, the index has been created – either in memory or the Local File System. The final task is to close the index to make it searchable, which can be done within the close method of the mapper, as demonstrated below:

  1. Close the created index.
  2. In the case of HDFS, walk through the index in memory and write it to HDFS.

The index should now have been created, either in the Local File System of each of the DataNodes, or in HDFS directly. If it is in the Local File System, you can opt to make the directory part of the “www” directory and enable Solr to serve it from there. If it is in HDFS, one could load the index in RAMDirectory within each mapper and search, use a tool like Luke to provide a search interface, or put a mechanism in place to copy it to the Local File System to point Solr at it.

Appending

Appending to an existing index can be a bit tricky. If the index sits in a Local File System, this can be accomplished by setting the index writer to APPEND mode and adding new documents. This can get a bit more complicated, however, when the index is in HDFS. One option would be to write an index to a new directory in HDFS, then merge with the existing index.

SolrCloud and Katta

Since we are discussing fast search options, it also makes sense to touch on components like SolrCloud and Katta.

SolrCloud enables clusters of Solr instances to be created, with a central configuration, automatic load balancing, resizing, rebalancing and fail-over.

Katta serves indexes in a distributed manner similar to HDFS. It has built in replication for fail-over and performance, is easy to integrate with Hadoop clusters and has master fail-over. However, it does not provide real-time updates, nor is it an indexer – it is simply a serving tool for Lucene indexes.

In Part 3 of this series, I will cover ways to ingest such email messages and ways to put the steps involved in a workflow. In the meantime, drop us a line if you have any questions on storing email message in Hadoop and index and search them using Solr and Lucene.

Filed under:

5 Responses
  • Otis Gospodnetic / January 03, 2012 / 11:26 PM

    Nice post.
    Some comments:
    * SolrCloud is not mature yet, but it’s getting there. (see Sematext Blog for more info)
    * Katta is kind of abandoned.
    * ElasticSearch is another option for the indexing/earch part of the story.
    * RAMDirectory is a bit evil, see https://issues.apache.org/jira/browse/LUCENE-3659 for more info and https://issues.apache.org/jira/browse/LUCENE-2292 for its possible replacement.

  • tiru / January 24, 2012 / 10:12 PM

    Nice post.
    instead of RAMDirectory can’ t we use hdfs for storing data (index)?
    another main point is how can we configure solr instance to hdfs is missing i think

  • Jason Rutherglen / February 15, 2012 / 4:51 PM

    There is no need to use RAMDirectory.

    SOLR-1301 addresses generating indexes in Hadoop and could be further improved.

Leave a comment


six × 3 =