Tackling Large Scale Data in Government

This is a guest post provided by Booz Allen Hamilton data analysis consultant, Aaron Cordova.  Aaron specializes in large-scale distributed data processing systems.

Working within the U.S. federal government arena provides plenty of opportunities to encounter large-scale data analysis. Projects ranging from massive health studies to high-velocity network security events to new sources of imagery and video represent a huge increase in the amount of data that must be not only stored but processed quickly and efficiently. These challenges are at once a daunting and exciting chance to turn data into a positive impact for the country. The large-scale data processing technologies created and supported by Cloudera play a big part in answering that challenge.

Often our clients have an immediate need to analyze the data at hand, to discover patterns, reveal threats, monitor critical systems, and make decisions about the direction the organization should take. Several constraints are always present: the need to implement new analytics quickly enough to capitalize on new data sources, limits on the scope of development efforts, and the pressure to expand mission capability without an increase in budgets. For many of these organizations, the large data processing stack (which includes the simplified programming model MapReduce, distributed file systems, semi-structured stores, and integration components, all running on commodity class hardware) has opened up a new avenue for scaling out efforts and enabling analytics that were impossible in previous architectures.

We have found this new ecosystem to be remarkably versatile at handling various types of data and classes of analytics. When working to help solve clients’ large-scale data analysis problems we first take a comprehensive look at their existing environment, resources available, the nature of data sources, and immediate questions that must be answered of the data. Usually a large data processing solution will be composed of several pieces that are composed into a system that provide the desired capability. This can range from real-time tipping to vast index and query capabilities to periodic and deep analysis of all the data available. Constructing a solution almost always requires one or more new and highly scalable components from the large-scale data analysis software stack, and integration with conventional data storage and processing software. Having the ability to pick and choose which elements of the stack to include and having well-defined interfaces and in some cases interoperability standards is essential to making the system work. This is a major reason that we value the open source community and concept of the large-scale data analysis ecosystem.

Perhaps the most exciting benefit, however, from moving to these highly scalable architectures is that after we’ve solved the immediate issues, often with a system that can handle today’s requirements and scale up to 10x or more, is that new analytics and capabilities are now incredibly easy to develop, evaluate, and integrate thanks to the speed and ease of MapReduce, Pig, Hive, and other technologies. More than ever the large-scale data analysis software stack is proving to be a platform for innovation.

The response to the challenge of large-scale data analysis continues to emerge and there is room for ongoing innovation. One example of this is evident as large-scale data systems or clouds become more numerous; the task of integrating analysis across those clouds remains an area of open research. Even integrating data sources, existing systems, and delivery mechanisms within departments of the same enterprise can be a challenge and may require new solutions.

Recently when researching the problem of large-scale Biometric search, we at Booz Allen realized the need for a highly scalable and low-latency method of performing fuzzy matching, i.e. returning the most similar items when there is no exact match, in response to growing databases of fingerprints and requirements to identify individuals quickly. It was clear that Hadoop would provide a good platform on which to build a new distributed fuzzy matching capability for several reasons. The ability to run MapReduce over all stored biometrics would help in clustering the data to reduce the search space. The data distribution and replication features of HDFS would provide a reliable storage system with enough parallelism to support fast queries running on multiple machines. The result of our research is a system we developed called FuzzyTable.

FuzzyTable is a large-scale, low-latency, parallel fuzzy-matching database built over Hadoop. It can use any matching algorithm that can compare two often high-dimensional items and return a similarity score. This makes it suitable not only for comparing fingerprints but other biometric modalities, images, audio, and anything that can be represented as a vector of features.

Fuzzy matching involves extracting features from the data of interest, and running a given fuzzy matching,  distance, or similarity algorithm over the resulting feature vectors to produce the numerical score.

Our work involved developing two major components – a clustering process that we use to reduce the total search space for each query, and a client-server system for performing fuzzy matching on demand and in parallel across a Hadoop instance.

In the first step, we use two clustering algorithms – canopy clustering and k-means – from the Apache Mahout project to assign each biometric into a bin.  Each bin contains biometrics that are statistically similar. Each bin also has a ‘mean biometric’ that represents an average of the biometrics contained in that bin.

When performing queries looking for the best match we first find the bin that a given biometric scores closest to by comparing it to the list of ‘mean biometrics’ from our k-means processing. This allows us to avoid searching a large portion of the data set and only search the bin (or small number of bins) that contains the most similar items to the biometric in question.

The FuzzyTable client then looks up the location of all blocks or chunks that contain biometrics for that bin by querying the FuzzyTable master server. Blocks are not replicas, rather they contain different sets of data (although using replication to improve the number of concurrent queries is possible). These blocks live on several HDFS DataNodes so that a single bin can be searched by several machines at once.

Finally the FuzzyTable client submits the biometric query and the ID of the closest bin to FuzzyTable query servers running alongside HDFS DataNodes which sift through the blocks of the closest bin comparing the query biometric to each stored biometric and return the most similar matches. The client collects all the matches from FuzzyTable query servers and displays a ranked list of matches to the user.

All this takes place in a few seconds. Because Hadoop can distribute our data automatically across multiple machines we only had to write the code to run comparisons on local data. HDFS exposes the location of data blocks making it possible to run the comparisons only on local data, which is also the key to the speed of MapReduce.

The following graph shows how query times (in milliseconds) improved dramatically as we added machines. After about seven machines, our query times became dominated by fixed costs that did not increase with the number of machines. Our test cluster could easily store several times more than our test data before we would see query times increase again.

?

Large-scale data stack components such as Hadoop and Mahout proved to be a perfect platform on which to create this new capability. Undoubtedly the trend of innovation will continue as we further explore the possibilities of these new reliable and easy-to-use software platforms that achieve scalability through parallel computation over distributed shared-nothing systems.

Booz Allen Hamilton has been at the forefront of strategy and technology consulting for nearly a century. Providing a broad range of services in strategy and organization, technology, operations, and analytics, Booz Allen is committed to delivering results that endure. To learn more, visit www.boozallen.com.

Aaron Cordova is a data analysis consultant at Booz Allen Hamilton, who specializes in large-scale distributed data processing systems.

Filed under:

11 Responses
  • Ben Standefer / November 23, 2010 / 12:19 PM

    I saw this talk at Hadoop World. It was really, really exciting, especially because I thought that at the end Aaron would announce that he’s open-sourced all of the stuff he got us so amped about. Aaron, are there any plans of open-sourcing the generic parts of this? I think it would be very interesting and there would be a lot of interest.

    -Ben Standefer

  • Kim Vogt / November 23, 2010 / 1:16 PM

    +1 for open sourcing all or parts of this

  • Aaron Cordova / November 23, 2010 / 2:58 PM

    Ben,

    Thanks for the interest. We are indeed working on making this project open source. Right now we’re trying to determine how to best do so. I’ll post updates as we near a decision.

    If you have a preference in terms of licensing terms, etc., please share and we will take that into consideration.

    Thanks,

    Aaron Cordova

  • Ben Standefer / November 23, 2010 / 3:51 PM

    Wow, that would be awesome Aaron. Thanks for the response. It all looks really awesome and I would love to take a closer look at it. Good luck on getting through the hoops to open source it. Let me know if I can help in any way.

    -Ben Standefer

  • Otis Gospodnetic / April 09, 2011 / 5:19 AM

    After having this tab open in my browser since November 2010 (it is not April 2011), I finally read this post. Good stuff. My immediate thoughts were:
    “Nice. This sounds very generic and flexible. I wonder if its open source”.

    I see you were working towards open-sourcing this and I wonder if that’s still happening?

    In terms of license – ASL v2 is nice and friendly.

    I also have a question:
    You mention fixed costs dominating the performance after 7 machines are involved. Could you please elaborate on what those fixed costs are and what the slowest part of the system is? (btw. open-sourcing might help you eliminate that bottleneck)

    Thanks.

  • Otis Gospodnetic / April 09, 2011 / 6:32 PM

    I’m also curious about that “mean biometric”. Could you please give an example of that so we can better understand what that looks like? Thanks.

Leave a comment


4 × seven =