The following post, by Sarah Cannon of Digital Reasoning, was originally published in that company’s blog. Digital Reasoning has graciously permitted us to re-publish here for your convenience.
At the beginning of each release cycle, engineers at Digital Reasoning are given time to explore the latest in Big Data technologies, examining how the frequently changing landscape might be best adapted to serve our mission. As we sat down in the early stages of planning for Synthesys 3.8 one of the biggest issues we faced involved reconciling the tradeoff between flexibility and performance. How can users quickly and easily retrieve knowledge from Synthesys without being tied to one strict data model?
Previously we had developed solutions to tackle each concern individually. Apache Pig scripts, Apache Hive queries, and Apache Hadoop streaming all provided unbounded access to Synthesys but with the caveat of extended run times, while our own query language implementation offered real-time results but was limited in scope. Was it possible to have it all? Cloudera’s Impala project offered a worthwhile investigation.
What is Impala?
From our friends at Cloudera: “Cloudera Impala is an open source Massively Parallel Processing (MPP) query engine that runs natively on Apache Hadoop”. Built for performance, Impala uses in-memory data transfers with its native query engine allowing users to issue SQL queries against HDFS and Apache HBase and receive results in seconds. Impala fits right in to the Hadoop ecosystem, making it simple to get started. If you’ve used Hive, the learning curve is even simpler — Impala shares Hive’s metastore and supports many of the same functions. Luckily, our recent work with Hive left us perfectly poised to take advantage of Impala’s great features.
Laying the Groundwork
In our previous release, Synthesys 3.7, we introduced Synthesys-Hive integration. Hive is a data warehousing tool that allows users to query distributed systems with a SQL-like language that projects structure onto the data. Hive queries generate map-reduce jobs, allowing data analysts the ability to perform ETL operations on data in a Hadoop environment in ways that previously required in-depth knowledge of the Java MapReduce API. Our integration efforts made it possible to query Synthesys Knowledge bases with Hive whether the backend was HBase, Apache Accumulo, or Apache Cassandra, and we provided several default tables, views, and UDFs (user defined functions) for getting started. This was a great first step at making the output of Synthesys more accessible and addressing the flexibility concerns mentioned above. Users could now use SQL to explore the power of concept resolution in the entities table, view author, date and summaries of each document through the document_metadata table, or discover significant relationships in the assertions table. Once we got a Synthesys-Hive environment up and running, adding Impala was simple. All we had to do was issue a single Hive query to pull data in a view from the backend onto a materialized table in HDFS, and we were up and running with Impala!
Real Time Results
Impala quickly proved to offer a robust solution to the performance/flexibility quandary. Now we can do things like fuzzy searches on concepts,
Browse assertions with ease,
Ask corpus summary statistics such as “How many categorized elements do I have, by category?”
and even answer questions like “With whom has President Obama met most often?”
All the above was available before with Hive, but with extended wait times between each query. Now anyone with a little bit of SQL know-how can get answers to their questions in an instant.
Impala’s Role in Synthesys 3.8
The latest release of Synthesys introduced the concept of Knowledge Objects — an aggregation of the important information connected to an entity. The Knowledge Objects workflow automatically writes out several Impala tables ready for querying immediately after ingestion. Thanks to Impala’s ODBC and JDBC connectors, Synthesys Knowledge bases can connect to a number of third-party visualization tools such as Tableau, Centrifuge, and Zoomdata. In just a few short months, Impala went from a concept to us here at Digital Reasoning to an invaluable tool solving pertinent problems. We even started using it within Synthesys’ latest user experience, Glance (see screenshot below). Synthesys Glance offers a comprehensive and intuitive user experience for exploring the output of Synthesys in real-time. Users can search for concepts and discover relevant details, recent news, and key relationships. Search results are displayed in easy to understand graphics and icons representing business entities, geopolitical entities, locations, and persons. Impala drives the Glance query layer, using the the tables generated by the Knowledge Objects analytic.
Thanks to Impala, we found that Digital Reasoning really can “have it all” in terms of a flexible and fast query engine.