How-to: Configure SAP HANA with Apache Impala (incubating)

Categories: How-to Impala

Combining HANA and Impala can unlock a variety of new use cases that span the full range of enterprise data. Here’s how to do it.

Information is growing at an exponential rate driven by enterprise applications and databases, and often takes the form of new types of data from sources such as social media, sensors, and mobile devices. Because it is not cost-effective to store and process all this information in an in-memory database, a hybrid architecture that includes Apache Hadoop can alleviate capacity challenges and resource utilization. Thus, a common pattern across SAP and Cloudera customers is the joining of "business data" (CRM, ERP, operational) in SAP HANA with "big data" (social data, historical data, images, text, and so on) in a Cloudera-powered enterprise data hub.

Working together, HANA and an EDH provide an architecture to facilitate use cases such as:

  • Simple IoT: Combining ERP data in HANA with pre-processed sensor data in Hadoop to perform basic IoT analysis (i.e. that can be expressed via SQL).
  • Customer insights: Combining customer interactions in your ERP or CRM, with larger data sets in Hadoop.
  • Advanced analytics: Pre-processing data in Hadoop via frameworks like Apache Spark and MLLib for consumption or combination with HANA data.
  • Active archive: Offloading larger, older, or colder datasets in HANA to Hadoop and make the data available for ad-hoc analysis.

Although HANA can integrate with either Apache Hive or Impala, the analytic database for Hadoop, Cloudera recommends using Impala in this situation (see Figure 1) because Impala’s MPP architecture meets the low latency/high concurrency requirements for doing BI/analytic queries. (See this post for recent information about Impala performance.) The integration is achieved via HANA’s Smart Data Access (SDA) feature, which brings in-memory computing to data virtualization in a manner similar to Cisco Data Virtualization (Composite).

hana-f1

Figure 1: HANA SDA connectivity to Impala

Furthermore, Cloudera, which offers comprehensive compliance-ready security across its platform, completed the SAP Technical Suitability Review (TSR) process, which certifies that Cloudera Enterprise is an SAP-supported integration for Impala and HANA SP9, in December 2015. Cloudera also completed the certification process for Impala and HANA SP9 in January 2016 on a cluster secured with MIT Kerberos authentication and Apache Sentry authorization.

In this remainder of this post, I will explain step-by-step how to integrate HANA with Impala to work together as a combined solution.

Setup and Configuration

Assuming an existing Cloudera Enterprise cluster with Impala services and HANA instances are running and that the HANA host has access to Impala daemons, configuring the integration is fairly straightforward

  1. Install the Impala ODBC driver on the HANA host.
  2. Configure the Impala data source.
  3. Create remote source and virtual tables using SAP HANA Studio; then test.

Next, let’s review these steps in detail.

Install and Configure the Impala ODBC Driver

The steps related to the installation and configuration should be performed on the HANA server. These instructions are targeted at SLES 11, but should be applicable with minor changes to other Linux distributions.

Install the Driver

  1. Install dependencies:
    1. sudo zypper install automake libtool gcc
  2. Download unixODBC:
    1. wget ftp ftp://ftp.unixodbc.org/pub/unixODBC/unixODBC-2.3.4.tar.gz
  3. Untar and build:
    1. tar -xvf unixODBC-2.3.4.tar.gz
    2. cd unixODBC-2.3.4/
    3. ./configure
    4. make
    5. sudo make install
  4. Download and install Cloudera driver:
    1. cd ~/
    2. wget https://downloads.cloudera.com/connectors/impala_odbc_2.5.30.1011/Linux/SLES11/ClouderaImpalaODBC-2.5.30.1011-1.x86_64.rpm
    3. sudo zypper install ClouderaImpalaODBC-2.5.30.1011-1.x86_64.rpm

Configure the ODBC Driver

Once the ODBC driver is installed, the following provides an example of configuring and testing an ODBC connection to Impala. Note that this assumes use of UnixODBC as the driver manager, as detailed in Steps 2 and 3 above.

Create the .odbcinst.ini file

A .odbcinst.ini file should be created in the home directory of the hdbadm user. hdbadm is the super user that starts the HANA database/services. A template can be copied from the driver installation to use as a starting point:

$ cp /opt/cloudera/impalaodbc/Setup/odbcinst.ini ~/.odbcinst.ini

An example of the .odbcinst.ini is below:

Create the .odbc.ini File

A .odbc.ini file should also be created in the home directory of the hdbadm user. An example is below. Note that this is for connecting to a non-secure CDH cluster:

The following are changes for a cluster secured with Kerberos:

Update the cloudera.impalaodbc.ini file

Modify the /opt/cloudera/impalaodbc/lib/64/cloudera.impalaodbc.ini to point to the right drivers, for example:

Update .customer.sh with paths

Add LD_LIBRARY_PATH with the UnixODBC library path if not already done to the .customer.sh file in the hdbadm home directory, as well as the Impala library location, .odbc.ini, nad .odbcinst.ini file paths:

Test the Connection

To test the connection, the isql utility, which is part of UnixODBC, can be used:

Configure HANA Studio

Once installed, HANA Studio needs to be configured to connect to the CDH cluster and use Impala tables in queries. The following provides an example of this using the ODBC configuration detailed above.

  1. Login to HANA Studio.
  2. Select provisioning -> remote sources ->
  3. Choose adapter HADOOP ODBC ; DSN – CLIMPALA; credential – Technical user, admin/admin

SAP supports the ability to create technical user credential (1 credential attached to 1 data source for ALL HANA Users) and secondary credential (1 credential attached to 1 HANA user and 1 data source). Please refer to SAP documentation for additional details on the security model and syntax to create credentials.

hana-f2

Create virtual tables using HANA Studio:

hana-f3

Choose the schema name, change the table name, and so on:

hana-f4

hana-f5

Alternatively, create virtual tables using the command prompt; if the schema is not specified, it will create them in the default one:

CREATE VIRTUAL TABLE customer AT "CLIMPALA"."IMPALA"."tpcds_parquet"."customer”;   

This following command will create a table inthe IMPALA HANA schema.

CREATE VIRTUAL TABLE IMPALA.customer AT "CLIMPALA"."IMPALA"."tpcds_parquet"."customer";

Test Cases and Queries

The following provides an overview of test cases used to validate the integration of HANA with Impala. Note that a number of queries which are part of the Impala TPC-DS test kit were used in testing; the full set of queries used are omitted here as they replicate the functionality being tested. 

All of these tests were run from HANA Studio.

Basic Integration

A simple query to validate the integration of HANA with Impala:

SELECT COUNT(*) FROM "IMPALA"."CUSTOMER"

hana-f6

You can notice from the execution time how fast the results are returned

Multi-Way Impala Join

A more complex multi-way join in Impala to demonstrate running a complex Impala query with HANA.

hana-f7

Join of Impala Data with HANA Data

This multi-way join between data in Impala and data in HANA demonstrates support for joining data between the two systems. Note that this join requires creating a copy of the ITEM table in HANA, which you can do via the HANA CREATE TABLE LIKE syntax. For example:

hana-f8

Congratulations, you have now combined HANA and a Cloudera-powered EDH into a single analytic solution that spans structured and unstructured data! If you happen to be attending SAPPHIRENOW 2016 in Orlando this week, stop by the Cloudera exhibit to learn more.

Sreedhar Bolneni is a Partner Sales Engineer at Cloudera.

Facebooktwittergoogle_pluslinkedinmailFacebooktwittergoogle_pluslinkedinmail

4 responses on “How-to: Configure SAP HANA with Apache Impala (incubating)

  1. Srinivas Totapally

    Great Blog Sreedhar. How do you compare the performance of HANA integration with Hadoop on Hive vs Impala?

    Thanks,
    Srinivas

  2. Wenjie Wu

    Thanks for sharing the block, it helps a lot. I followed the steps and it works with isql from command, but when I configure from HANA Studio, received the below error;
    com.sap.db.jdbc.exceptions.JDBCDriverException: SAP DBTech JDBC: [403]: internal error: Cannot get remote source objects: [unixODBC][

    Do you know what could be the possible cause of the issue?

    thanks
    Wenjie

  3. chandrakala

    Hi Can you help me that what is data base name do i need to mention in odbc.ini for HANA sda with cloudera.

    Also can you help me with steps for Cloudera hive to connect to HANA SDA.

    Thanks
    Chandrakala

  4. Srinivas Totapally

    We have successfully connected and validated the connection from HANA to a Hadoop cluster using Impala, Hive and the Spark Controller. Here is what I suggest you all:
    Please ensure the port 21050 is open from your HANA system to the Hadoop cluster for Impala connectivity.
    Please ensure the port 10000 is open from your HANA system to the Hadoop cluster for Hive connectivity.
    Please ensure the ports 7860, 7861, 56000-58000 are open from your HANA system to the Hadoop cluster to connect to the Spark Controller.
    You can do a basic telnet connectivity test from your HANA system to the hadoop nodes to ensure the respective services/daemons
    are running and listening on.
    Thanks,
    Srinivas Totapally