A New Python Client for Impala

Categories: Data Science Impala

The new Python client for Impala will bring smiles to Pythonistas!

As a data scientist, I love using the Python data stack. I also love using Impala to work with very large data sets. But things that take me out of my Python workflow are generally considered hassles; so it’s annoying that my main options for working with Impala are to write shell scripts, use the Impala shell, and/or transfer query results by reading/writing local files to disk.

To remedy this, we have written the (unpredictably named) impyla Python package (not officially supported). Impyla communicates with Impala using the same standard Impala protocol as the ODBC/JDBC drivers. This RPC library is then wrapped by the commonly-used database API specified in PEP 249 (“DB API v2.0”). Below you’ll find a quick tour of its functionality. Note this is still a 0.x.y release: the PEP 249 client is beta, while the sklearn and udf submodules are pre-alpha.

Installation

Impyla works with Python 2.6 and 2.7 (support for Python 3 is planned for the future). It depends minimally on the Thrift Python package. Install it with pip:

 

or clone the repository.

Querying Impala from Python

Start by connecting to the Impala instance and getting a cursor. (Learn more about the API by checking out PEP 249.

 

Note: make sure to set the port to the HS2 service, rather than the Beeswax service. The default in Cloudera-managed clusters is port 21050 for HS2. (The Impala shell defaults to port 21000, which is for Beeswax.)

You can now issue SQL queries to the impalad using the Cursor object.

 

After running .execute(), Impala will store the result sets on the server until it is fetched. Use the method .fetchall() to pull the entire result set over the network (though this could be a bad idea if the result set is large). Note that impyla pulls data from the server in a buffered fashion; the buffer size (in terms of number of rows) can be controlled by setting the .arraysize property (as detailed in PEP 249).

For queries that have result sets, the resulting schema is available in the .description property.

 

Notice that fetching the data removes it from the buffer on the server, so a second call to .fetchall() returns an empty list.

Finally, the Cursor object supports Python’s iterator interface:

 

(The iterator interface still pulls data in a buffered fashion as described above.)

While it may seem like a bad idea to simply stream the result set over the network to your client, this is not actually a problem. You would generally only do this for smaller result sets or for Impala-computed digests of your data set. If you need to store a large data set, you could simply write it to HDFS with a CREATE TABLE AS SELECT statement.

Getting Results as pandas DataFrames

In addition to iterating over the resulting tuples, you can easily convert the query results into a pandas DataFrame object (note that you need to have pandas installed for this to work):

 

Be careful with this as well as it also tries to fetch the entire data set.

Preview of Functionality in Development

Two features that are currently in development are a scikit-learn API for in- database machine learning and a Python compiler to enable writing fast UDFs with Python. Note that this functionality is (pre)alpha.

With the introduction of UDFs in Impala, we have started to implement distributed machine learning algorithms to be executed by Impala. We have SGD- based implementations of SVMs, logistic regression, and linear regression implemented in C++ as Impala UDAs. These functions are wrapped, in turn, using the scikit-learn API. For example, here is how a session would look:

 

These functions are implemented in impala.sklearn, but are not supported yet.

One of the most exciting features in development is the Numba-based UDF compiler. Building UDFs for Impala currently requires writing C++ or Java code and registering them manually with the cluster. Writing C++/Java code is more difficult, time-consuming, and error-prone for many data analysts. We are working on making the Numba LLVM Python compiler compatible with Impala. This would allow you to work from Python, express your UDFs using Python’s simple syntax, and have them run as fast as if you wrote them in C++ to begin with. The module will also register your functions with Impala with a single function call. Here is an example:

 

This functionality is being developed in the impala.udf submodule in the udf git branch. If this was done in C++, the function would be substantially more complex, would have to be compiled as a separate step, and shipped/registered with Impala. The best part is that the performance of the function is as good as C++, because the Python interpreter is never involved on the execution side.

Stay tuned for more. And please provide feedback, in the form of use cases, bugs, or contributions!

Uri Laserson (@laserson) is a data scientist at Cloudera. Questions/comments are welcome (as are contributions to impyla).

facebooktwittergoogle_pluslinkedinmailfacebooktwittergoogle_pluslinkedinmail

18 responses on “A New Python Client for Impala

  1. Randy Carnevale

    The pandas.DataFrame.to_sql() method will also let you push data into Impala if you use pandas’ legacy SQL support and the MySQL engine:

    conn = impala.dbapi.connect(host=’localhost’)
    cursor = conn.cursor()

    cursor.execute(“DROP TABLE IF EXISTS test_df”)

    df = pd.DataFrame(np.reshape(range(16), (4, 4)), columns=[‘a’, ‘b’, ‘c’, ‘d’])
    df.to_sql(name=”test_df”, con=conn, flavor=”mysql”)

    cursor.execute(“SELECT * FROM test_df”)
    impala.util.as_pandas(cursor)

  2. Helge

    This is just awesome! Great work.

    There was some talk a while ago to make Pandas dataframe operations work out-of-core for big data. Not exactly sure what happened to that idea. Should work nicely with impala though. Any plans for work in that direction?

  3. Niels Stender

    Great work!

    I am running a Kerberos-enabled Hadoop cluster.

    Does impyla support Kerberos connections? I see a use_kerberos and kerberos_service_name argument in impyla.dbapi.connect, but no guidelines as how to use it.

    1. Uri Laserson (@laserson) Post author

      impyla does support Kerberos connections (I’ve heard). The connection machinery is the same as for the regular impala-shell, so use the docs for impala-shell as a guideline.

    1. Uri Laserson (@laserson) Post author

      Hi Tomas, the current database in use is part of the state of the connection. After you connect, and after you get a cursor, there will already be a default database in use. If you want to instead use the database my_db, you can run cur.execute("USE my_db") just like you would on the normal Impala shell. Alternatively, you can generally refer to tables in a fully-qualified fashion, like my_db.my_table.

    1. Uri Laserson (@laserson) Post author

      Hi Michael, so the impyla package will not bring any data back to python unless you explicitly request it. impyla sends the SQL queries to Impala on the Hadoop cluster, and all the computation then takes place on the cluster. The UDFs get executed by Impala on the cluster, as do the sklearn/MADlib functions (which are just UDFs/UDAs).

  4. Aleksandar

    Hi,
    I am using pandas 0.16.2 and example for
    df.to_sql(name=”test_df”, con=conn, index=False, if_exists=’replace’, flavor=”mysql”)

    will fail with:
    pandas.io.sql.DatabaseError: Execution failed on sql: SHOW TABLES LIKE %s
    AnalysisException: Syntax error in line 1:
    SHOW TABLES LIKE %s

    btw, mysql flavor will be depreciated in favor of sqlalchemy …
    I was told that backed on the impala side is mysql.

    Anyone has idea if this is solvable…

  5. Murat

    Hello

    I have VARCHAR2 fileds in impala tables and I have string variables in python scripts as you know there is no varchar2 variable in python. So how can I insert string varible into varchar2 field. I got an error like in following line

    Possible loss of precision for target table ‘test.tablo01’.
    Expression ”xxxrrrrttttttt” (type: STRING) would need to be cast to VARCHAR(128) for column ‘varchar2column’

    Thanks

    1. Uri Laserson Post author

      Hi Murat,

      Would you mind redirecting your question to the impala-user mailing list? Also, could you supply the query that you ran and the schema of the table?

      Thanks!
      Uri

      1. Murat

        Hello again

        I found a solution, following type casting in sql statement works
        values (%f, %f, %f, %f, %f, %f, %f, %f, %f, %f, %f, cast(‘%s’ as VARCHAR(128)), ‘%s’)
        According to Impala documentation casting string to varchar is not supported but it works :)

        Thanks

  6. Murat

    Hello

    I have another issue :)

    I am using impyla 0.13.0.dev0 with python 2.6.6 on RHEL 6.6

    I am trying to insert into impala multiple records with executemany function
    My code snippet is in following lines

    listRow.append(listRowC)
    whlCount = whlCount+1 #end of while loop
    sqlNA=”insert into devices values (%d, ‘%s’, %d, %d, ‘%s’, ‘%s’, ‘%s’, %d, %d, ‘%s’, %d, %d, %d, ‘%s’, ‘%s’, ‘%s’, %d, ‘%s’, %d, %d, %d, %d, ‘%s’, ‘%s’, ‘%s’)”
    curNA.executemany (sqlNA, listRow)

    I am formatting listRowC variable as list before listRow.append function.

    When I run the code I get following error

    impala.error.ProgrammingError: Incorrect number of bindings supplied. The current statement uses 13 or more, and there are 25 supplied.

    I also tried to print out *len(listRowC)* output if it’s not 25. So all the outputs 25 so that means I am passing 25 values in sqlNA string.

    How can I fix this problem ? Any suggestion ?

    Thanks

  7. Joe K

    Has anybody run into a ‘No protocol version header’ exception (which is TProtocolException(type=4))?

    I’m able to connect and query Hive with pyhs2 fine, but cannot get beyond the error above with Impyla. Full traceback below, any help appreciated!

    …………………………..
    (faro3) ~/tmp python impyla_test.py
    Traceback (most recent call last):
    File “impyla_test.py”, line 3, in
    cursor = conn.cursor()
    File “/Users/JoeK/dev/virtual_envs/faro3/lib/python3.4/site-packages/impala/hiveserver2.py”, line 88, in cursor
    session = self.service.open_session(user, configuration)
    File “/Users/JoeK/dev/virtual_envs/faro3/lib/python3.4/site-packages/impala/hiveserver2.py”, line 798, in open_session
    resp = self._rpc(‘OpenSession’, req)
    File “/Users/JoeK/dev/virtual_envs/faro3/lib/python3.4/site-packages/impala/hiveserver2.py”, line 724, in _rpc
    response = self._execute(func_name, request)
    File “/Users/JoeK/dev/virtual_envs/faro3/lib/python3.4/site-packages/impala/hiveserver2.py”, line 741, in _execute
    return func(request)
    File “/Users/JoeK/dev/virtual_envs/faro3/lib/python3.4/site-packages/thriftpy/thrift.py”, line 159, in _req
    return self._recv(_api)
    File “/Users/JoeK/dev/virtual_envs/faro3/lib/python3.4/site-packages/thriftpy/thrift.py”, line 171, in _recv
    fname, mtype, rseqid = self._iprot.read_message_begin()
    File “/Users/JoeK/dev/virtual_envs/faro3/lib/python3.4/site-packages/thriftpy/protocol/binary.py”, line 364, in read_message_begin
    self.trans, strict=self.strict_read)
    File “/Users/JoeK/dev/virtual_envs/faro3/lib/python3.4/site-packages/thriftpy/protocol/binary.py”, line 178, in read_message_begin
    message=’No protocol version header’)
    thriftpy.protocol.exc.TProtocolException: TProtocolException(type=4)

  8. sri

    Hi All,
    Question:- is the python code executing out of Hadoop cluster or is it executing inside Hadoop cluster, I am looking for a solution which lets me connect impala out of Hadoop cluster (Client ENV) and run quires and dump data out to client file system . I am having trouble connecting impala out of Hadoop cluster using JDBC drivers and kerberos authentication.
    Thanks
    Sri

  9. Chrys

    Hello,
    Is it possible to have a timeout for a query execution?
    To have the possibility to stop the execution if an x amount of seconds have passed and the query is not finished yet.
    Something like…
    conn = connect(host=host_name, port=host_port, database=database_name)
    cursor = conn.cursor()
    cursor.execute(‘SELECT * FROM database.table’,, timeout=3600 )

    Thanks
    Chrys

Leave a Reply

Your email address will not be published. Required fields are marked *