A New Python Client for Impala

The new Python client for Impala will bring smiles to Pythonistas!

As a data scientist, I love using the Python data stack. I also love using Impala to work with very large data sets. But things that take me out of my Python workflow are generally considered hassles; so it’s annoying that my main options for working with Impala are to write shell scripts, use the Impala shell, and/or transfer query results by reading/writing local files to disk.

To remedy this, we have written the (unpredictably named) impyla Python package (not officially supported). Impyla communicates with Impala using the same standard Impala protocol as the ODBC/JDBC drivers. This RPC library is then wrapped by the commonly-used database API specified in PEP 249 (“DB API v2.0”). Below you’ll find a quick tour of its functionality. Note this is still a 0.x.y release: the PEP 249 client is beta, while the sklearn and udf submodules are pre-alpha.

Installation

Impyla works with Python 2.6 and 2.7 (support for Python 3 is planned for the future). It depends minimally on the Thrift Python package. Install it with pip:

 

or clone the repository.

Querying Impala from Python

Start by connecting to the Impala instance and getting a cursor. (Learn more about the API by checking out PEP 249.

 

Note: make sure to set the port to the HS2 service, rather than the Beeswax service. The default in Cloudera-managed clusters is port 21050 for HS2. (The Impala shell defaults to port 21000, which is for Beeswax.)

You can now issue SQL queries to the impalad using the Cursor object.

 

After running .execute(), Impala will store the result sets on the server until it is fetched. Use the method .fetchall() to pull the entire result set over the network (though this could be a bad idea if the result set is large). Note that impyla pulls data from the server in a buffered fashion; the buffer size (in terms of number of rows) can be controlled by setting the .arraysize property (as detailed in PEP 249).

For queries that have result sets, the resulting schema is available in the .description property.

 

Notice that fetching the data removes it from the buffer on the server, so a second call to .fetchall() returns an empty list.

Finally, the Cursor object supports Python’s iterator interface:

 

(The iterator interface still pulls data in a buffered fashion as described above.)

While it may seem like a bad idea to simply stream the result set over the network to your client, this is not actually a problem. You would generally only do this for smaller result sets or for Impala-computed digests of your data set. If you need to store a large data set, you could simply write it to HDFS with a CREATE TABLE AS SELECT statement.

Getting Results as pandas DataFrames

In addition to iterating over the resulting tuples, you can easily convert the query results into a pandas DataFrame object (note that you need to have pandas installed for this to work):

 

Be careful with this as well as it also tries to fetch the entire data set.

Preview of Functionality in Development

Two features that are currently in development are a scikit-learn API for in- database machine learning and a Python compiler to enable writing fast UDFs with Python. Note that this functionality is (pre)alpha.

With the introduction of UDFs in Impala, we have started to implement distributed machine learning algorithms to be executed by Impala. We have SGD- based implementations of SVMs, logistic regression, and linear regression implemented in C++ as Impala UDAs. These functions are wrapped, in turn, using the scikit-learn API. For example, here is how a session would look:

 

These functions are implemented in impala.sklearn, but are not supported yet.

One of the most exciting features in development is the Numba-based UDF compiler. Building UDFs for Impala currently requires writing C++ or Java code and registering them manually with the cluster. Writing C++/Java code is more difficult, time-consuming, and error-prone for many data analysts. We are working on making the Numba LLVM Python compiler compatible with Impala. This would allow you to work from Python, express your UDFs using Python’s simple syntax, and have them run as fast as if you wrote them in C++ to begin with. The module will also register your functions with Impala with a single function call. Here is an example:

 

This functionality is being developed in the impala.udf submodule in the udf git branch. If this was done in C++, the function would be substantially more complex, would have to be compiled as a separate step, and shipped/registered with Impala. The best part is that the performance of the function is as good as C++, because the Python interpreter is never involved on the execution side.

Stay tuned for more. And please provide feedback, in the form of use cases, bugs, or contributions!

Uri Laserson (@laserson) is a data scientist at Cloudera. Questions/comments are welcome (as are contributions to impyla).

Filed under:

10 Responses
  • Randy Carnevale / May 09, 2014 / 8:51 AM

    The pandas.DataFrame.to_sql() method will also let you push data into Impala if you use pandas’ legacy SQL support and the MySQL engine:

    conn = impala.dbapi.connect(host=’localhost’)
    cursor = conn.cursor()

    cursor.execute(“DROP TABLE IF EXISTS test_df”)

    df = pd.DataFrame(np.reshape(range(16), (4, 4)), columns=['a', 'b', 'c', 'd'])
    df.to_sql(name=”test_df”, con=conn, flavor=”mysql”)

    cursor.execute(“SELECT * FROM test_df”)
    impala.util.as_pandas(cursor)

  • Helge / May 11, 2014 / 11:58 PM

    This is just awesome! Great work.

    There was some talk a while ago to make Pandas dataframe operations work out-of-core for big data. Not exactly sure what happened to that idea. Should work nicely with impala though. Any plans for work in that direction?

  • Niels Stender / July 31, 2014 / 6:34 AM

    Great work!

    I am running a Kerberos-enabled Hadoop cluster.

    Does impyla support Kerberos connections? I see a use_kerberos and kerberos_service_name argument in impyla.dbapi.connect, but no guidelines as how to use it.

    • Uri Laserson (@laserson) / July 31, 2014 / 9:45 AM

      impyla does support Kerberos connections (I’ve heard). The connection machinery is the same as for the regular impala-shell, so use the docs for impala-shell as a guideline.

  • Tomas / August 08, 2014 / 2:31 PM

    How do I choose the DB to query? Looks like the “connect” function does not take a ‘database’ argument.

    • Uri Laserson (@laserson) / August 08, 2014 / 5:10 PM

      Hi Tomas, the current database in use is part of the state of the connection. After you connect, and after you get a cursor, there will already be a default database in use. If you want to instead use the database my_db, you can run cur.execute("USE my_db") just like you would on the normal Impala shell. Alternatively, you can generally refer to tables in a fully-qualified fashion, like my_db.my_table.

  • Tomas / August 08, 2014 / 6:10 PM

    Thanks for the reply, that works. (Now I see that ‘database’ is *optional* in PEP 249.)

  • Michael / September 09, 2014 / 9:16 AM

    This brings the data to python, which will in general not scale. How does one execute python in the context of the data in Cloudera?

    • Uri Laserson (@laserson) / September 09, 2014 / 11:34 AM

      Hi Michael, so the impyla package will not bring any data back to python unless you explicitly request it. impyla sends the SQL queries to Impala on the Hadoop cluster, and all the computation then takes place on the cluster. The UDFs get executed by Impala on the cluster, as do the sklearn/MADlib functions (which are just UDFs/UDAs).

  • JS / September 24, 2014 / 10:19 PM

    Just came across this post – thank you for making Impala so easy to use via Python. This solves some HUGE use cases.

Leave a comment


− 2 = one