Announcing hs2client, A Fast New C++ / Python Thrift Client for Impala and Hive

Categories: Data Science Hive Impala Tools

This new (alpha) C++ client library for Apache Impala (incubating) and Apache Hive provides high-performance data access from Python.

Earlier this year, members of the Python data tools and Impala teams at Cloudera began collaborating to create a new C++ library to eventually become a faster, more memory-efficient replacement for impyla, PyHive, and other (largely pure Python) client libraries for talking to Hive and Impala.

We are excited to release this effort, dubbed hs2client, as a new Apache-licensed open source project on GitHub. As you may guess from the name, this library implement the HiveServer2 Thrift API as a C++ library, with careful handling of result sets to allow languages like Python to access data at high performance.

The initial alpha preview release contains:

  • A C++ library, libhs2client, which provides a clean C++ API for the HiveServer2 Thrift API. This can be built and dynamically or statically linked in C/C++ applications with no direct exposure to Apache Thrift.
  • Python bindings, with optimized reads to pandas.DataFrame

To try out the library, you can install a dev build of the project right now with conda:

Optimizing for Speed and Memory Use

The Python bindings for libhs2client are careful about performance and memory in a few key ways:

  • Converting row batches directly into pandas-compatible NumPy arrays
  • Care with categorical data: string data is deduplicated while it’s being converted for pandas. In the future, it would be straightforward to add an option to return all string data as pandas.Categorical arrays.

hs2client Demo and Simple Benchmark

The initial API is oriented at modeling the HiveServer2 protocol closely. Connecting to a cluster yields a Service instance:

To execute queries, you open a session then use its execute method.

Queries are asynchronous by default.

Result sets can be fetched currently to pandas.DataFrame objects:

Compared with comparable database clients, we’ve optimized the fetch path to have excellent performance (IO speeds) and memory use.

This benchmark is being run on localhost, but it shows we can move over 110MB/s through the Thrift protocol and convert to a fully-formed pandas DataFrame.

hs2client Compared with impyla

In Ibis and other Python projects, we have been using impyla to execute queries and access result sets. impyla either uses Apache Thrift’s official Python implementation (on Python 2) or thriftpy (on Python 3) for interacting with the Impala or Hive Thrift service. Because of this, the main difference is the performance in fetching large result sets.

Here, I will perform an equivalent fetch using impyla via Ibis:

In this particular use case with an all-numeric table (which is ideal for deserialization performance), the speed difference is only 2x or a bit more.

hs2client Roadmap

The C++ library does not yet implement some important features that are needed to be a drop-in replacement for impyla or other Hive or Impala drivers:

  • SSL transport (with certificate verification)
  • SASL Thrift transport for secure (i.e. Kerberos) clusters, or insecure clusters using SASL
  • Note that many Hive clusters, even unsecured ones, use the SASL transport, so using the library with Hive right now may be a challenge until that is completed.

On the Python side, we must implement a DB API 2.0 compatibility layer, since currently data can only be fetched to pandas, not Python tuples as with most Python database drivers.

We of course welcome contributions from the community to build out some of these features.

Wes McKinney is a Software Engineer at Cloudera. He is the creator of Python’s pandas library and the Ibis project, a committer to Apache Arrow and Apache Parquet, and the author of Python for Data Analysis

Marcel Kornacker is the architect of Impala and its tech lead.

Matt Jacobs and Thomas Tauber-Marshall are Software Engineers working on Cloudera’s Impala team.

facebooktwittergoogle_pluslinkedinmailfacebooktwittergoogle_pluslinkedinmail

2 responses on “Announcing hs2client, A Fast New C++ / Python Thrift Client for Impala and Hive

Leave a Reply

Your email address will not be published. Required fields are marked *