A New Python Client for Impala

Categories: Data Science Impala

The new Python client for Impala will bring smiles to Pythonistas!

As a data scientist, I love using the Python data stack. I also love using Impala to work with very large data sets. But things that take me out of my Python workflow are generally considered hassles; so it’s annoying that my main options for working with Impala are to write shell scripts, use the Impala shell, and/or transfer query results by reading/writing local files to disk.

To remedy this, we have written the (unpredictably named) impyla Python package (not officially supported). Impyla communicates with Impala using the same standard Impala protocol as the ODBC/JDBC drivers. This RPC library is then wrapped by the commonly-used database API specified in PEP 249 (“DB API v2.0”). Below you’ll find a quick tour of its functionality. Note this is still a 0.x.y release: the PEP 249 client is beta, while the sklearn and udf submodules are pre-alpha.


Impyla works with Python 2.6 and 2.7 (support for Python 3 is planned for the future). It depends minimally on the Thrift Python package. Install it with pip:


or clone the repository.

Querying Impala from Python

Start by connecting to the Impala instance and getting a cursor. (Learn more about the API by checking out PEP 249.


Note: make sure to set the port to the HS2 service, rather than the Beeswax service. The default in Cloudera-managed clusters is port 21050 for HS2. (The Impala shell defaults to port 21000, which is for Beeswax.)

You can now issue SQL queries to the impalad using the Cursor object.


After running .execute(), Impala will store the result sets on the server until it is fetched. Use the method .fetchall() to pull the entire result set over the network (though this could be a bad idea if the result set is large). Note that impyla pulls data from the server in a buffered fashion; the buffer size (in terms of number of rows) can be controlled by setting the .arraysize property (as detailed in PEP 249).

For queries that have result sets, the resulting schema is available in the .description property.


Notice that fetching the data removes it from the buffer on the server, so a second call to .fetchall() returns an empty list.

Finally, the Cursor object supports Python’s iterator interface:


(The iterator interface still pulls data in a buffered fashion as described above.)

While it may seem like a bad idea to simply stream the result set over the network to your client, this is not actually a problem. You would generally only do this for smaller result sets or for Impala-computed digests of your data set. If you need to store a large data set, you could simply write it to HDFS with a CREATE TABLE AS SELECT statement.

Getting Results as pandas DataFrames

In addition to iterating over the resulting tuples, you can easily convert the query results into a pandas DataFrame object (note that you need to have pandas installed for this to work):


Be careful with this as well as it also tries to fetch the entire data set.

Preview of Functionality in Development

Two features that are currently in development are a scikit-learn API for in- database machine learning and a Python compiler to enable writing fast UDFs with Python. Note that this functionality is (pre)alpha.

With the introduction of UDFs in Impala, we have started to implement distributed machine learning algorithms to be executed by Impala. We have SGD- based implementations of SVMs, logistic regression, and linear regression implemented in C++ as Impala UDAs. These functions are wrapped, in turn, using the scikit-learn API. For example, here is how a session would look:


These functions are implemented in impala.sklearn, but are not supported yet.

One of the most exciting features in development is the Numba-based UDF compiler. Building UDFs for Impala currently requires writing C++ or Java code and registering them manually with the cluster. Writing C++/Java code is more difficult, time-consuming, and error-prone for many data analysts. We are working on making the Numba LLVM Python compiler compatible with Impala. This would allow you to work from Python, express your UDFs using Python’s simple syntax, and have them run as fast as if you wrote them in C++ to begin with. The module will also register your functions with Impala with a single function call. Here is an example:


This functionality is being developed in the impala.udf submodule in the udf git branch. If this was done in C++, the function would be substantially more complex, would have to be compiled as a separate step, and shipped/registered with Impala. The best part is that the performance of the function is as good as C++, because the Python interpreter is never involved on the execution side.

Stay tuned for more. And please provide feedback, in the form of use cases, bugs, or contributions!

Uri Laserson (@laserson) is a data scientist at Cloudera. Questions/comments are welcome (as are contributions to impyla).


13 responses on “A New Python Client for Impala

  1. Randy Carnevale

    The pandas.DataFrame.to_sql() method will also let you push data into Impala if you use pandas’ legacy SQL support and the MySQL engine:

    conn = impala.dbapi.connect(host=’localhost’)
    cursor = conn.cursor()

    cursor.execute(“DROP TABLE IF EXISTS test_df”)

    df = pd.DataFrame(np.reshape(range(16), (4, 4)), columns=[‘a’, ‘b’, ‘c’, ‘d’])
    df.to_sql(name=”test_df”, con=conn, flavor=”mysql”)

    cursor.execute(“SELECT * FROM test_df”)

  2. Helge

    This is just awesome! Great work.

    There was some talk a while ago to make Pandas dataframe operations work out-of-core for big data. Not exactly sure what happened to that idea. Should work nicely with impala though. Any plans for work in that direction?

  3. Niels Stender

    Great work!

    I am running a Kerberos-enabled Hadoop cluster.

    Does impyla support Kerberos connections? I see a use_kerberos and kerberos_service_name argument in impyla.dbapi.connect, but no guidelines as how to use it.

    1. Uri Laserson (@laserson) Post author

      impyla does support Kerberos connections (I’ve heard). The connection machinery is the same as for the regular impala-shell, so use the docs for impala-shell as a guideline.

    1. Uri Laserson (@laserson) Post author

      Hi Tomas, the current database in use is part of the state of the connection. After you connect, and after you get a cursor, there will already be a default database in use. If you want to instead use the database my_db, you can run cur.execute("USE my_db") just like you would on the normal Impala shell. Alternatively, you can generally refer to tables in a fully-qualified fashion, like my_db.my_table.

    1. Uri Laserson (@laserson) Post author

      Hi Michael, so the impyla package will not bring any data back to python unless you explicitly request it. impyla sends the SQL queries to Impala on the Hadoop cluster, and all the computation then takes place on the cluster. The UDFs get executed by Impala on the cluster, as do the sklearn/MADlib functions (which are just UDFs/UDAs).

  4. Aleksandar

    I am using pandas 0.16.2 and example for
    df.to_sql(name=”test_df”, con=conn, index=False, if_exists=’replace’, flavor=”mysql”)

    will fail with:
    pandas.io.sql.DatabaseError: Execution failed on sql: SHOW TABLES LIKE %s
    AnalysisException: Syntax error in line 1:

    btw, mysql flavor will be depreciated in favor of sqlalchemy …
    I was told that backed on the impala side is mysql.

    Anyone has idea if this is solvable…

  5. Murat


    I have VARCHAR2 fileds in impala tables and I have string variables in python scripts as you know there is no varchar2 variable in python. So how can I insert string varible into varchar2 field. I got an error like in following line

    Possible loss of precision for target table ‘test.tablo01’.
    Expression ”xxxrrrrttttttt” (type: STRING) would need to be cast to VARCHAR(128) for column ‘varchar2column’


    1. Uri Laserson Post author

      Hi Murat,

      Would you mind redirecting your question to the impala-user mailing list? Also, could you supply the query that you ran and the schema of the table?


Leave a Reply

Your email address will not be published. Required fields are marked *