A New Python Client for Impala
The new Python client for Impala will bring smiles to Pythonistas!
As a data scientist, I love using the Python data stack. I also love using Impala to work with very large data sets. But things that take me out of my Python workflow are generally considered hassles; so it’s annoying that my main options for working with Impala are to write shell scripts, use the Impala shell, and/or transfer query results by reading/writing local files to disk.
To remedy this, we have written the (unpredictably named) impyla Python package (not officially supported). Impyla communicates with Impala using the same standard Impala protocol as the ODBC/JDBC drivers. This RPC library is then wrapped by the commonly-used database API specified in PEP 249 (“DB API v2.0”). Below you’ll find a quick tour of its functionality. Note this is still a
0.x.y release: the PEP 249 client is beta, while the
udf submodules are pre-alpha.
Impyla works with Python 2.6 and 2.7 (support for Python 3 is planned for the future). It depends minimally on the Thrift Python package. Install it with
$ [sudo] pip install impyla
or clone the repository.
Querying Impala from Python
Start by connecting to the Impala instance and getting a cursor. (Learn more about the API by checking out PEP 249.
>>> from impala.dbapi import connect >>> conn = connect(host='my.impala.host', port=21050) >>> cur = conn.cursor()
Note: make sure to set the port to the HS2 service, rather than the Beeswax service. The default in Cloudera-managed clusters is port 21050 for HS2. (The Impala shell defaults to port 21000, which is for Beeswax.)
You can now issue SQL queries to the
impalad using the
>>> cur.execute('SHOW TABLES') >>> cur.fetchall() [('answers',), ('cand',), ('diff',), ('merged',), ('merged2',), ('patients',), ('patients2',), ('procedures',), ('procedures2',), ('score2',), ('score3',), ('suspicious',)] >>> cur.execute('DESCRIBE patients') >>> cur.fetchall() [('id', 'string', ''), ('gender', 'string', ''), ('age', 'string', ''), ('label', 'string', '')]
.execute(), Impala will store the result sets on the server until it is fetched. Use the method
.fetchall() to pull the entire result set over the network (though this could be a bad idea if the result set is large). Note that impyla pulls data from the server in a buffered fashion; the buffer size (in terms of number of rows) can be controlled by setting the
.arraysize property (as detailed in PEP 249).
For queries that have result sets, the resulting schema is available in the
>>> cur.execute('SELECT * FROM patients LIMIT 5') >>> cur.description [('id', 'STRING_TYPE', None, None, None, None, None), ('gender', 'STRING_TYPE', None, None, None, None, None), ('age', 'STRING_TYPE', None, None, None, None, None), ('label', 'STRING_TYPE', None, None, None, None, None)] >>> cur.fetchall() [('000000003', 'F', '45-64', '0t'), ('000000399', 'M', '<18', '0t'), ('000000531', 'M', '18-44', '0t'), ('000001194', 'F', '45-64', '0t'), ('000001884', 'F', '<18', '0t')] >>> cur.fetchall() 
Notice that fetching the data removes it from the buffer on the server, so a second call to
.fetchall() returns an empty list.
Cursor object supports Python’s iterator interface:
>>> cur.execute('SELECT * FROM patients LIMIT 5') >>> for row in cur: ... print row == 'M' ... True True False False False
(The iterator interface still pulls data in a buffered fashion as described above.)
While it may seem like a bad idea to simply stream the result set over the network to your client, this is not actually a problem. You would generally only do this for smaller result sets or for Impala-computed digests of your data set. If you need to store a large data set, you could simply write it to HDFS with a
CREATE TABLE AS SELECT statement.
Getting Results as pandas
In addition to iterating over the resulting tuples, you can easily convert the query results into a pandas
DataFrame object (note that you need to have pandas installed for this to work):
>>> from impala.util import as_pandas >>> cur.execute('SELECT * FROM patients LIMIT 5') >>> df = as_pandas(cur) >>> type(df) <class 'pandas.core.frame.DataFrame'> >>> df id gender age label 0 000001517 M 45-64 0t 1 000001814 F 45-64 0t 2 000002312 F 18-44 0t 3 000002381 F 18-44 0t 4 000002705 M <18 0t
Be careful with this as well as it also tries to fetch the entire data set.
Preview of Functionality in Development
Two features that are currently in development are a scikit-learn API for in- database machine learning and a Python compiler to enable writing fast UDFs with Python. Note that this functionality is (pre)alpha.
With the introduction of UDFs in Impala, we have started to implement distributed machine learning algorithms to be executed by Impala. We have SGD- based implementations of SVMs, logistic regression, and linear regression implemented in C++ as Impala UDAs. These functions are wrapped, in turn, using the scikit-learn API. For example, here is how a session would look:
from impala.sklearn import LogisticRegression data_query = 'SELECT * FROM test_logr' label_col = 'label' impala_logr = LogisticRegression() impala_logr.fit(cur, data_query, label_col) process(impala_logr.coef_)
These functions are implemented in
impala.sklearn, but are not supported yet.
One of the most exciting features in development is the Numba-based UDF compiler. Building UDFs for Impala currently requires writing C++ or Java code and registering them manually with the cluster. Writing C++/Java code is more difficult, time-consuming, and error-prone for many data analysts. We are working on making the Numba LLVM Python compiler compatible with Impala. This would allow you to work from Python, express your UDFs using Python’s simple syntax, and have them run as fast as if you wrote them in C++ to begin with. The module will also register your functions with Impala with a single function call. Here is an example:
from impala.udf import udf, ship_udf from string import split @udf(StringVal(FunctionContext, StringVal)) def hour_from_weird_data_format(context, date): return split(date, '-') ship_udf(cur, hour_from_weird_data_format, '/path/to/store/udf.ll', 'my.impala.host') cur.execute('SELECT hour_from_weird_data_format(date) AS hour FROM mytable LIMIT 100') cur.fetchall()
This functionality is being developed in the
impala.udf submodule in the
udf git branch. If this was done in C++, the function would be substantially more complex, would have to be compiled as a separate step, and shipped/registered with Impala. The best part is that the performance of the function is as good as C++, because the Python interpreter is never involved on the execution side.
Stay tuned for more. And please provide feedback, in the form of use cases, bugs, or contributions!
Uri Laserson (@laserson) is a data scientist at Cloudera. Questions/comments are welcome (as are contributions to impyla).