This post was contributed by Kurt Rohloff, a researcher in the Information and Knowledge Technologies group of Raytheon BBN Technologies, a wholly owned subsidiary of Raytheon Company.
Using Hadoop to Build a Scalable, Distributed Triple Store
The driving idea behind Semantic Web is to provide a web-scale information sharing model and platform. One of the singular advancements over the past several years in the Semantic Web domain has been the explosion of data available in semantic formats. Unfortunately, Semantic Web data processing technologies are deployed on a single (or a small number of) machine(s) at a time. This is fine when data is small, but current methodologies create horrible data processing and analysis bottlenecks. These scalability constraints are the biggest barriers in achieving the fundamentally web-scale Semantic Web vision of Tim Berners-Lee. More importantly, these limitations have hindered the broader adoption of Semantic Web technologies.
Some of my fellow scientists and engineers at BBN and I have started to address these scalability limitations in the Semantic Web by developing a work-in-progress cloud-based triple-store technology we call SHARD (Scalable, High-Performance, Robust and Distributed) that enables scalable data processing and analysis based on the Cloudera Distribution for Hadoop implementation of the MapReduce formalism. My cowokers and I found that this formalism enables a game-changing approach to scalable, robust and lightly coordinated parallel data processing that avoids some of the coordination bottlenecks endemic in distributed computing environments.
Our SHARD triple-store persists data as standard RDF triples and runs queries over this data using the standard SPARQL query-language. We deployed an early version of SHARD into Amazon EC2 and ran the standard LUBM triple-store benchmark. We found that SHARD already performs better than current industry-standard triple-stores for datasets on the order of a billion triples. A small example graph can be seen in Figure 1.
This small graph contains 7 triples Kurt lives in Cambridge, Kurt owns an object car0, car0 is a car, car0 was made by Ford, car0 was made in Detroit, Detroit is a city and Cambridge is a city.
The SPARQL query language is the standard query language for querying the triple data. SPARQL semantics are remarkably similar to the more well-known SQL. An example SPARQL query for the above graph data can be seen immediately below.
?person :owns ?car .
?car :a :car .
?car :madeIn :Detroit .
The above SPARQL query has three clauses and asks for all matches to the variable ?person such that ?person owns an entity represented by the variable ?car which is a car and was made in Detroit. The above query can be represented as a directed graph as seen in Figure 2.
Processing of SPARQL queries such as the one above consists of identifying which variables in the query clauses can be bound to nodes in the data graph such that the query clauses align with the data triples. An example of this alignment for our example query and data can be seen in Figure 3.
The Shard Triple Store
Our functional design goals for the SHARD triple-store are to:
1. Serve as a persistent store for triple data in RDF format.
2. Serve as a SPARQL endpoint to process SPARQL queries.
The SHARD triple-store is designed to persist graph data and process SPARQL queries. Input (data, queries) and output (query results) are passed to/from SHARD through the HDFS file system. We made this design decision with the understanding that the input data and output results are generally very large and not feasible to output directly to the user. In light of our experience with large-scale Semantic Web users, we think it is most efficient to save query results to disk for eventual use by users.
Triple Data Persistence
We persist data in SHARD in flat files in the HDFS file system such that each line of the triple-store text file represents all triples associated with a different subject. Consider the following exemplar line saved in SHARD from the LUBM domain that represents three triples associated with the entity subject Pub1:
Pub1 :author Prof0 :name “Pub1” a :Publication
This line represents that the entity Pub1 has an author entity Prof0, Pub1 has a name Pub1 and that Pub1 is a publication.
Although this approach to persisting triple data as flat text files is rudimentary as compared to other triple-store approaches, we found that it offers a number of important benefits for several general application domains. For one, this approach to saving files in HDFS brings a level of automated robustness to the triple-store that would normally be difficult to develop using other distributed file systems. The data is also stored in a simple, easy to read format that lends itself to easier drill-down diagnostics of query results returned by the triple-store. Most importantly, however, although this approach to storing triples is inefficient for query processing that requires the inspection of only a small number of triples, this approach is very efficient in the context of Hadoop for scanning over large sets of triples to respond to queries that will generate a large number of results.
Query Processing Overview
The query processing engine in SHARD is designed to iterate over the clauses in the queries using the triple data and incrementally attempt to bind query variables to literals in the triple data while satisfying all of the query constraints. Each step of the iteration consists of a MapReduce operation for a single clause in the query. A schematic overview of this iterative query binding process can be seen in Figure 4.
We made the design decisions that intermediate results of the SPARQL query operation are cached to disk to speed later similar queries. We found that this capability is very useful in practice when users make frequent, similar queries of the triple data.
The first map MapReduce step maps the triple data to a list of variable bindings which satisfy the first clause of the query. The key of the Map step is the list of variable bindings. The Reduce step removes duplicate results and saves them to disk with the variable bindings as the key.
The intermediate query binding steps continue to iteratively bind variables to literals as new variables are introduced by processing successive query clauses and/or filtering the previous bindings which cannot fit the new clauses. The intermediate steps perform a MapReduce operation over both the triple data and the previously bound variables which were saved to disk.
The ith intermediate Map step identifies all variables in the triple-data which satisfy the ith clause and saves this result with the key being any variables in the ith clause which appeared in previous clauses. The value of this Map step is the bindings of other variables not previously seen in the query clauses, if any. This iteration of the Map set also rearranges the results of the previous variable bindings saved to disk to the same of a variable key in the ith clause that appeared in previous clauses. The value of this key-value pair are the list of variable bindings which occurred in previous clauses but not in the ith clause.
The ith Reduce step runs a join operation over the intermediate results from the Map step by iterating over all pairs of results from the previous clause and the new clause with the same key assignment.
This iteration of map-reduce-join continues until all clauses are processed and variables are assigned which satisfy the query clauses. SHARD is designed to save intermediate results of the query processing to speed up the processing of similar later queries.
The final MapReduce step consists of filtering bound variable assignments to satisfy the SELECT clause of the SPARQL query. In particular, the Map step filters each of the bindings, and the Reduce step removes duplicates where the key value for both Map and Reduce are the bound variables in the SELECT clause.
PROOF OF CONCEPT EXPERIMENTATION
To test the performance of the SHARD triple-store design, we deployed an early version of SHARD onto an Amazon EC2 cloud environment of 20 XL compute nodes running the Cloudera distribution of Hadoop. The version of SHARD we deployed for evaluation supports basic SPARQL query functionality (without support for prefixes, optional clauses or results ordering) over full RDF data. Additionally, the deployed version of SHARD does not perform any query manipulation/reordering/etc normally done for increased performance by SPARQL endpoints and the deployed version of SHARD does not take advantage of any possible query caching made possible by our design choices.
We used the standard LUBM triple-store benchmark to evaluate the performance of SHARD. The LUBM benchmark creates artificial data about the publishing, coursework and advising activities of students and faculty in departments in universities. We used code from the LUBM benchmark to generate triple data for 6000 universities which is approximately 800 million triples to parallel the performance evaluations made in a previous triple-store comparison study.
After loading the triple data into the SHARD triple store, we evaluated the performance of SHARD in responding to queries 1, 9 and 14 of LUBM as was done in the previous triple-store study. Query 1 is very simple and asks for the students that take a particular course and returns a very small set of responses. Query 9 is relatively more complicated query with a triangular pattern of relationships – it asks for all teachers, students and courses such that the teacher is the adviser of the student who takes a course taught by the teacher. Query 14 is relatively simple as it asks for all undergraduate students (but the response is very large). Except for Query 1, SHARD performed better than other known technologies. Also, due to the inherent scalability of the Hadoop and HDFS approach in SHARD, the SHARD triple-store could potentially be used for extremely large datasets (trillions of triples) without requiring any specialized hardware as required for other monolithic triple-stores.
Development work is ongoing with SHARD. Based on our experience with the initial SHARD deployment, we have several short- and long-term activities to further improve performance and applicability.
First among the improvements to be made to SHARD is a more intelligent indexing capability to better optimize query performance on small queries. Additional performance of SHARD in a targeted production environment could be provided by using cached partial results, as outlined above. This will require additional capability for the triple store to track what partial results were previously cached and possibly to track which cached results could be thrown out to save disk space in the cloud (if this is a user concern.)