Implementing Temporal Graphs with Apache TinkerPop and HGraphDB

Categories: Graph Processing Hadoop HBase How-to

When most people think of Big Data, often they imagine loads of unstructured data. However, there is always some sort of structure or relationships within this data. Based on these relationships there are one or more representation schemes best suited to handle this type of data. A common pattern seen in the field is hierarchy/relationship representation. This form of representation is adept in handling scenarios like complex business models, chain of event or plans, chain of stock orders in banks, user relations in social media website etc. making it popular among financial institutions and social media world.

Storing and changing these hierarchical data sets can be a challenge, especially if data is in temporal form (time period attached to the data expresses when it was valid). One simple way to effectively store this data can be using graphs.  A lot of the time these hierarchical data sets are in the form of a tree which is just a specialized version of the graph. Storing data in this form has some inherent advantages such as easy representation / visualization of data, fast and well-known traversal algorithms proven across the industry, easy manipulation of parts of data without changing or duplicating the entire datasets.

This blog explains how to use an up and coming graph framework like Apache TinkerPop[1] over big data storage like Apache HBase[4] to simulate the hierarchical structures. HBase was chosen as a storage layer because of its fast retrieval and scalability. Additionally, HBase is also suitable for handling low response times demanded by online applications while still having Big Data storage capabilities.

Apache TinkerPop

What is Apache TinkerPop?
Apache TinkerPop is an open source graph computing framework composed of various interoperable components. TinkerPop uses simple and intuitive terms like Vertices (nodes of a graph representing data points) and Edges (connections between these vertices representing relationships between data points) to represent complex hierarchical data structure of nodes and relationships between them.

TinkerPop also comes with its own graph traversal language called Gremlin, which makes traversing nodes and searching for relations between them as easy as specifying vertices and looking for specific edges between them.
More details for Apache TinkerPop is available at: http://tinkerpop.apache.org/docs/current/reference/

As mentioned earlier Apache TinkerPop comprises interoperable components. The following diagram from Apache TinkerPop documentation[1] will show this in more detail.

Apache TinkerPop interoperable components

 

This interoperability means any underlying storage and processing engine can be used to support scalability and meet required SLAs with industry standard tools.
For instance, Apache Spark can be used as the graph processor which can take the load of processing large and complex graphs using standard Hadoop clusters.
Similarly, the underlying graph database can also be any storage platform like Apache HBase which supports fast updates and retrieval, together with storage of vast amounts of data.
This interoperability is all possible because of the provider APIs, which means if there is an implementation of these API for the required platform then it can be used.
There are various implementations for the provider API to support different underlying storages.

HGraphDB

One such implementation discussed in this blog is HGraphDB[2] implemented by Robert Yokota at Yammer. HGraphDB is an implementation of Apache TinkerPop 3 interface that uses Apache HBase as the underlying graph database. Like TinkerPop HGraphDB stores nodes and relationship between them as vertices and edges in HBase Tables.

HGraphDB uses a tall table approach to store graph data in Apache HBase. That is, each of the graph’s nodes and relationships between them is stored in one HBase row as vertex and edge. The Unique Key (row IDs) for these rows is the combination of VertexID and EdgeID. HGraphDB also stores “created” and “updated” time for all vertices and edges along with multiple user-defined property columns for better management and searching of vertices and edges. These user-defined properties are what makes storing and updating temporal graphs easier without duplicating the entire graph.

Apart from this, HGraphDB also provides indexes for faster traversal of graphs. These indexes are based on user-defined properties, one per index.

HGraphDB uses five tables in Apache HBase and their structure is explained in more detail below.

Vertex Table:

This table stores all the vertices. Columns and their represented values for this table are as follows:
Label: Text label assigned to vertices for multiple types of vertices
Created: Timestamp when a vertex was created
Updated: Timestamp of the last update for this vertex
Property #: User-defined property for this vertex and its associated value. There can be as many property columns as required

Example of one entry for a graph node is as below

vertex table

example of vertex table entry

Edge Table:

This table stores all the edges between graph nodes and weights or values (Property) assigned to these edges are stored as user defined property column. Columns and their represented values for this table are as follows:
Label: Text label assigned to vertices for multiple types of vertices
Created: Timestamp when an Edge was created
Updated: Timestamp of the last update for this Edge
From Vertex: Vertex where this edge originates from
To Vertex: Destination vertex for this edge
Property #: User defined property for this Edge and its associated value. There can be as many property columns as required

An example of one such edge entry is as below.

edge table

Example of edge table entry

Vertex Index table:

This table stores indexes of all the vertices for faster searching vertices based on their property values.

Columns and their represented values for this table are as follows:
Key: combination of vertex label, property for which this index is created, value of the property and vertex ID
Created at: Timestamp of creation of this index value
Vertex ID: ID of the vertex that is being indexed

Following is an example of the table:

vertex index table

Edge Index table:

Similarly, vertex index table, this table stores indexes for all edges for faster searching based on property values.
Columns and their represented values for this table are as follows:
Key: combination of vertex ID of the first vertex associated with this edge, direction of edge related to first vertex, property for which this index is created, label of this edge, value of the property and vertex ID for the second vertex associated with this edge
Created at: Timestamp of creation of this index value

Following is an example of same.

edge index table

Example of edge index table

Index Metadata table:

This table contains metadata related to the indexes created on each property and specifies whether the index is currently active or not.
Columns and their represented values for this table are as follows:
Key: A unique value
Created at: Timestamp of creation of this index metadata value
Updated at: Timestamp of the last update for this index metadata value
State: Current state of this index if it’s active or not

Following is an example of the table.

index metadata table

Example of index metadata table

More details and source code for HGraphDB is available at
https://github.com/rayokota/hgraphdb

Working Project

By this point you should have a good understanding of base components and their working, the rest of the blog will describe an example data generator and query module to show how to implement hierarchical structure using Apache HBase as the graph database. Choosing HBase here because of its scalability and fast response times.

Code for this project is available at
https://github.com/ashishtyagicse/TemporalGrapsUsingHBase-TinkerPop

This project has two main components

  • Data generator: Generates test data / graph
  • Query engine: Queries the generated data to test performance and functionality

Data Generator:

Code in Java class with name “datagen.java” generates multiple hierarchies in a form of a balanced binary tree. It persists the data into Apache HBase using HGraphDB.

Following is a call to HGraphDB for generating a graph in HBase

Parameters for graph generated can be controlled by constants file[3] which defines constants like number of nodes, suffixes and prefixes used for vertex and edge names etc.

Graph generated by the datagen has following property’s

  • Each vertex has a VertexID, a combination of prefix defined in constants file and node number
  • All Vertices have one user defined property called Created_on, a value of seconds from Epoch. This is a test property to demonstrate how to define properties of a vertex.

  • Same as vertices, edges have an EdgeID, a combination of prefix defined in constants file and edge number
  • Edges in this graph has two user defined property “Since” and “Till”
  • “Since” property  defines when an edge was created
  • “Till” property defines till when an edge was active
  • Together these two properties help  maintain the temporal nature of a graph
  • Apart from a static base graph, the datagen also generates some updates to the base graph which acts as temporal data and shows a different state of the graph at a given point of time.
  • These temporal data points are created by creating new edges and manipulating “Till” property of existing edges.

Running this datagen will generate required number of hierarchal data sets in HBase each with a binary tree/ graph complete with random temporal data.

Query Engine:

After data is generated in HBase by datagen, queryHierarchy java file code query’s graphs sitting in HBase using Apache TinkerPop.

Query engine in this code can get two types of data from graphs in HBase

  • Get the required root of a node at any given point in time
  • Get the entire hierarchy up to a given level as it appears at a given point in time

Query Root:

To get the all the nodes in path from a given node to its root at any given point in time the code performs the following series of steps

  • First get a graph instance. Use same HGraphDB command as a create graph but this time with create as false.
  • Once we have the graph instance we can then use Gremlin commands to find the requested vertex in question.

  • Next, find all the incoming edges of this vertex and check their since and till property to find the one edge that was active for the requested time.
  • This will give an active parent for the requested node at a specified time.

  • Once the parent is found, repeat the same process as above to get the entire path till root.

These simple setup demonstrated how to use edge properties to maintain and manipulate the same vertex for creating different hierarchies in time i.e. having temporal graphs.

Query Hierarchy:

Let’s say you would like to find information about full lifecycle of a stock order. (i.e. you need to reconstruct all parent orders for this order). In this case, a query for entire hierarchy is needed for that  day.

To get the entire hierarchy, code performs the following series of steps

  • Same as with root query, first get a graph instance. This can be done by same HGraphDB command as a create graph but this time with create as false.
  • Then find anyone random vertex of this graph that was active for given moment in time.
  • Once this vertex is found perform a Query root operation as described above to get the root for this vertex for that given moment in time.
  • After root for hierarchy is found for given time the entire hierarchy can be rebuild
  • Rebuilding the entire hierarchy tree is split in multiple threads using a threadpool. One vertex at a time is assigned to a thread from this pool for processing.

  • Within this thread, check all the out edges of this vertex and then find the once that were active at required point of time.
  • After edges are found check their out vertex and continue pushing those vertexes to thread pool until end of hierarchy is not reached.

Conclusion

Some of the takeaways from this blog are:

  • Common perception is that Big Data is unstructured but in most cases, there is some sort of structure to this data
  • In field, hierarchal/relationship representation is more common
  • Graphs are best for representation of hierarchical data
  • Graph frameworks like TinkerPop are best as they offer interoperable components
  • HBase is best Storage layer when scalability and online applications type response time are required
  • HGraphDB provides an easy implementation for using HBase as graph database for TinkerPop framework

In conclusion, handling temporal hierarchical data is tricky and can take a large amount of processing power, storage and subject expertise.

But with help of tools like Apache TinkerPop and HGraphDB, it can be done with relative ease while still taking advantage of scalable storage engines like Apache HBase and processing power of Apache spark on a Hadoop cluster. In addition, using TinkerPop framework and HGraphDB could be more productive as Gremlin style traversal language is more intuitive for people across the industry, and reduces the effort for building custom traversal logic.

Resources

  1. Apache TinkerPop
  2. HGraphDB
  3. Datagen and Query Engine
  4. Apache HBase

Please note, as Apache TinkerPop and HGraphDB are not tested or included with Cloudera CDH, further support or assistance with these Apache projects would be found in the Apache open source community and mailing lists for those projects.

 

Facebooktwittergoogle_pluslinkedinmailFacebooktwittergoogle_pluslinkedinmail

One response on “Implementing Temporal Graphs with Apache TinkerPop and HGraphDB

  1. sindhu

    This blog help me to understand about hadoop concept and then hadoop and big data course will provide best platform to our career.
    keep updating some informative blog like this.

Leave a Reply

Your email address will not be published. Required fields are marked *

Prove you're human! *