Text Mining with Impala

Categories: Guest Impala Use Case

Thanks to Torsten Kilias and Alexander Löser of the Beuth University of Applied Sciences in Berlin for the following guest post about their INDREX project and its integration with Impala for integrated management of textual and relational data.

Textual data is a core source of information in the enterprise. Example demands arise from sales departments (monitor and identify leads), human resources (identify professionals with capabilities in ‘xyz’), market research (campaign monitoring from the social web), product development (incorporate feedback from customers), and the medical domain (anamnesis).

In this post, we describe In-Database Relation Extraction (INDREX), a system that transforms text data into relational data with Impala (the open source analytic database for Apache Hadoop), with low overhead needed for extraction, linking, and organization. Read this paper for complete details about our approach and our implementation.

Introduction to INDREX

Currently, mature systems exist for either managing relational data (such as Impala) or for extracting information from text (such as the Stanford CoreNLP). As a result, the user must ship data between systems. Moreover, transforming textual data in a relational representation requires “glue” and development time to bind different system landscapes and data models seamlessly. Finally, domain-specific relation extraction is an iterative task. It requires the user to continuously adopt extraction rules and semantics in both, the extraction system and the database system. As a result, many projects that combine textual data with existing relational data may likely fail and/or will become infeasible.

In contrast, INDREX is a single system for managing textual and relational data together. The figure below shows two possible system configurations for INDREX; both read and transform textual data into generic relations in an ETL approach and store results in base tables in the Parquet file format. Once generic relations are loaded, the INDREX user defines queries on base tables to extract higher-level semantics or to join them with other relational data.

The first configuration imitates state-of-the-art batch processing systems for text mining, such as Stanford Core NLP, IBM’s System-T, or GATE. These systems use HDFS for integrating text data and loading relational data from an RDBMS. The user poses queries against HDFS-based data with proprietary query languages or by writing transformation rules with user-defined functions in languages like Pig Latin, Java, or JAQL.

The second configuration utilizes Impala and INDREX. This approach permits users to describe relation extraction tasks, such as linking entities in documents to relational data, within a single system and with SQL. For providing this powerful functionality, INDREX extends Impala with a set of white-box user-defined functions that enable corpus-wide transformations from sentences into relations. As a result:

  • The user can utilize structured data from tables in Impala to adapt extraction rules to the target domain
  • No additional system is needed for relation extraction, and
  • INDREX can benefit from the full power of Impala’s built-in indexing, query optimization, and security model

Figure 1 shows the initial ETL (Steps 1, 2, and 3) and compares two implementations. Both setups read text data from HDFS sequence files (Step 1) and process the same base linguistic operations (Step 2). The system writes the output, generic domain independent relations, in highly compressed HDFS Parquet files (Step 3). The user refines these generic relations into domain-specific relations in an iterative process. [One implementation uses MapReduce on Hadoop (Step 4.1) and writes results into HDFS sequence files (Step 5.1).] The user analyzes the result from HDFS (Step 6) and may refine the query. INDREX permits also a much faster approach in Impala (Step 4.2) where the user inspects results (Step 5.2) and refines the query.
This approach benefits from optimizing query workflows in Impala.

Data Model

Each document, such as an email or a web page, comprises one or more sequence of characters. For instance, we denote the interval of characters from the last sentence with the type document:sentence. We call a sequence of characters attached with a semantic meaning a span. Our data model is based on spans and permits the user (or the application) to map multiple spans to an n-ary relation. For example, the relation PersonCareerAge(Person, Position, Age) requires to map spans to three attribute values. Our model is flexible enough to hold additional important structures for common relation extraction tasks, such as extracting document structures, shallow or deep syntactic structures, or structures for resolving entities within and across documents (see details in our paper).  

In practice, either a batch-based ETL process, or an ad-hoc query, may generate such spans.

The diagram above illustrates an example corpus of three documents. Each document consists of a single sentence. The figure shows the output after common language-specific annotations, such as tokenization, part-of-speech (POS) tagging, phrases chunking, dependencies, constituencies (between the phrases), OIE-relationship candidates, entities, and semantic relationships. Each horizontal bar represents an annotation with one span and each arrow an annotation with multiple spans.

The semantic relationship PersonCareerAge is an annotation with three spans. The spans “Torsten Kilias” in document 26 and “Torsten” in document 27 are connected by a cross-document co-reference annotation. The query in the right-bottom corner shows a join between a table, including positions about persons in an organization, and spans about relations of these persons that have been extracted from the three documents shown in the diagram.

Joining Text with Tables and Extracting Domain-Dependent Relations

Domain-dependent operations combine base features from above with existing entities from structured data (also called entity linkage). The user may add domain specific rules for extracting relationship or attribute types or attribute value ranges. In INDREX, these domain-specific operations are part of an iterative query process and are based on SQL. (Recall the example corpus, in which the user joins spans with persons and their positions from an existing table.)

The example query below in Impala demonstrates the extraction of the binary relationship type PersonCareer(Person, Position). After an initial training, writing these queries is as simple as writing SQL queries in Impala.

Aggregation and Group By Queries from Text Data

INDREX supports aggregations, such as MIN(), MAX(), SUM(), or AVG(), directly from text data and within Impala.

The query above shows the process of extracting a person’s age. In our simple example, the user expresses the age argument to the person as apposition, using a comma. The query applies span and consolidation operators (explained in our paper) and finally shows a distribution of age information for persons appearing in the text.   

The query below shows a simple person-age extractor. In addition, it provides a distribution for age-group information in a corpus. The outer SELECT statement computes the aggregation for each age-group observation, grouped and ordered by age. The nested SELECT statement projects two strings: the person and the age from conditions in the WHERE clause that implement the heuristic of a syntactic apposition, in this case represented by the pattern


Experimental Results

For a dataset, we chose the Reuters Corpus Volume 1 (RCV1), which is a standard evaluation corpus in the information retrieval community. It contains information approximately 800k news from 1997 and shows characteristic Zipfian distributions for text data and derived annotations: Only few relations appear frequent on many documents while most relations only appear in a few documents. (See our work here for a detailed analysis.) Querying such typical distributions for text data requires Impala’s optimizations for selective queries, parallel query execution, or data partitioning schemes. Overall, the raw corpus is approximately 2.5GB in size. 

Our generic relation extraction stack created more than 2,500 annotations per document on average or roughly 2 billion annotations for our 800k documents. After annotating the corpus with the Stanford CoreNLP pipeline and ClausIE, the annotated corpus achieved a size of approximately 107GB. Overall, we could observe that our linguistic base annotations increase the raw data by nearly two orders of magnitudes.

The testing compared two configurations, INDREX on Apache Pig and INDREX on Impala:

  • INDREX on Hadoop + Pig
    We ran Hadoop 2.3.0 and Pig 0.12 on a cluster with nine nodes. Each node consists of 24 AMD cores, each with 2.4 Ghz, 256GB RAM, and 24 disks. Overall, we had 9 x 24 cores available. The cluster runs the Ubuntu 12.0 operating system and CDH 5.1. We assigned Hadoop up to 200 cores for Map tasks and 100 cores for Reduce tasks. Each task consumes up to 8GB RAM. Reduce tasks start if 90 percent of the map tasks are complete.
  • INDREX on Impala
    We tested Impala version 1.2.3 on the same machine setup and can leverage up to 200 multiple cores too. Impala uses a pipeline-based query processing approach and retrieves main memory only on demand, such as for building in-memory hash tables to store intermediate results from an aggregation query. Profiling Impala revealed that the system did not retrieve more than 10GB during query execution. We implemented INDREX on top of Impala as so-called macros, which are basically white-box UDF implementations.   
  • Queries and measurements
    We investigated (1) the feasibility if INDREX permits the user to execute common query scenarios for text mining tasks, and, (2) the performance of INDREX for common query types. Our set of 27 benchmark queries included typical operations for text mining workflows, such as point and range selections at various attribute selectivities, local joins within the same sentence or the same document, joins with external domain data, and queries using user defined table generating functions (UDTs) or user defined aggregation functions (UDAs), as well as global aggregation queries across individual documents. (See queries and details in the appendix of this paper.).

The outcome is that Impala outperforms Hadoop/Pig by nearly two orders of magnitude for text mining query workflows.

The left chart above shows query execution times for both systems over all queries. We observe a similar runtime for Pig Latin across all queries. Each query in Pig reads tuples from disc. Furthermore, Pig always executes a full table scan over these tuples and, by design, always stores query results back to the distributed file system.

For Impala, however, we observed nearly two orders of magnitude faster query execution times because Impala conducts various optimizations that also benefit our iterative text mining workflows. For understanding this impressive performance, see the details for each individual operation.

The right chart compares the aggregated runtime for individual operations, such as complex and local queries (including local joins), joins with external data, aggregations, and selections.


In a business setting, there are significant costs for information extraction, including the labor cost of developing or adapting extractors for a particular business problem, and the throughput required by the system. Our extensive evaluations show that INDREX can return results for most text mining queries within a few seconds in Impala. Most operations on text data are embarrassingly parallel, such as local and join operations on a single sentence or operations on a single document.

Future work on INDREX will contain approaches that drastically simplify writing queries, such as semi-supervised learning approaches for entity-linkage tasks. Other directions are instant query refinements and instant results; see also our slide deck about INDREX.

INDREX is a project of the DATEXIS.COM research group of the Beuth University of Applied Sciences Berlin, Germany. The team conducts research at the intersection of Database Systems and Text-based Information Systems. Alexander Löser leads the DATEXIS group. His previous stations include HP Labs Bristol, the IBM Almaden Research Center, and the research division of the SAP AG. Torsten Kilias is a second-year PhD student with Alexander and conducts work in the area of scalable in-database text mining.