How-to: Do Statistical Analysis with Impala and R

The new RImpala package brings the speed and interactivity of Impala to queries from R.

Our thanks to Austin Chungath, Sachin Sudarshana, and Vikas Raguttahalli of Mu Sigma, a Decision Sciences and Big Data analytics company, for the guest post below.

As is well known, Apache Hadoop traditionally relies on the MapReduce paradigm for parallel processing, which is an excellent programming model for batch-oriented workloads. But when ad hoc, interactive querying is required, the batch model fails to meet performance expectations due to its inherent latency.

To overcome this drawback, Cloudera introduced Cloudera Impala, the open source distributed SQL query engine for Hadoop data. Impala brings the necessary speed to queries that were otherwise not interactive when executed by the batch Apache Hive engine; Hive queries that used to take minutes can be executed in a matter of seconds using Impala.

Impala is quite exciting for us at Mu Sigma because existing Hive queries can run interactively with few or no changes. Furthermore, because we do a lot of our statistical computing on R, the popular open source statistical computing language, we considered it worthwhile to bring the speed of Impala to R.

To meet that goal, we have created a new R package, RImpala, which connects Impala to R. RImpala enables querying the data residing in HDFS and Apache HBase from R, which can be further processed as an R object using R functions. RImpala is now available for download from the Comprehensive R Archive Network (CRAN) under GNU General Public License (GPL3).

The RImpala architecture is simple: we used the existing Impala JDBC drivers and wrote a Java program to connect and query Impala, which we then called from R using the rJava package. We put them all together in an R package that you can use to easily query Impala from R.

Steps for Installing RImpala

Assuming that you have R and Impala already installed, installing the RImpala package is straightforward and is done in a manner similar to any other R package. There are two steps to installing RImpala and getting it working:

Step 1: Install the package from CRAN

You can install RImpala directly using the install.packages() command in R.

 

Alternatively, if you need to do offline installation of the package, you can download it from here and install using the R CMD INSTALL command:

 

Step 2: Install the Impala JDBC drivers

You need to install Cloudera’s JDBC drivers before you can use the RImpala package that we installed earlier. Cloudera provides JBDC jars on its website that you can download directly. As of this writing, this is the link to zip file containing the JDBC jars.

There are two ways to do this:

  1. If you have Impala installed on the machine running R, then you will have the necessary JDBC jars already (probably in /usr/lib/impala/lib) and you can use them to initiate the connection to Impala.
  2. If the machine running R is a different server than the Impala server, then you need to download the JDBC jars from the above link and extract it to a location that can be accessed by the R user.

After you have installed the JDBC drivers you can start using the RImpala package:

  1. Load the library.

     

  2. Initialize the JDBC jars.

     

  3. Connect to Impala.

     

    The following is an Rscript showing how to connect to Impala:

     

    Location of JDBC jars = /tmp/impala/jars

    IP of the server running impalad service = 192.168.10.1

    Port where the impalad service is listening = 21050

The default parameter for the rimpala.init() function is “/usr/lib/impala/lib” and the default parameters for rimpala.connect() function are “localhost” and “21050” respectively.

To run a query on the impalad instance that the client has connected, you can use the rimpala.query() function. Example:

 

All the contents of the sample_table will be stored in the result object as a data frame. This data frame can now be used for further analytical processing in R.

You can also install the RImpala package on a client machine running Microsoft Windows. Since the JDBC jars are platform independent, you can extract them into a folder on a Windows machine (such as “C:\Program Files\impala”) and then this location can be passed as a parameter to the rimpala.init() function.

The following a simple example that shows you how to use RImpala:

 

Conclusion

Impala is an exciting new technology that is gaining popularity and will probably grow to be an enterprise asset in the Hadoop world. We hope that RImpala will be a fruitful package for all Big Data analysts to leverage the power of Impala from R.

Impala is an ongoing and thriving effort at Cloudera and will continue to evolve with richer functionality and improved performance – and so will RImpala. We will continue to improve the package over time and incorporate new features into RImpala as and when they are made available in Impala.  

Austin Chungath is a Senior Research Analyst with Mu Sigma’s Innovation & Development Team and maintainer of the RImpala project. He does research on various tools in the Hadoop ecosystem and the possibilities that they bring for analytics. He spends his free time contributing to Open Source projects like Apache Tez or building small robots.

Sachin Sudarshana is a Research Analyst with Mu Sigma’s Innovation & Development Team. His responsibilities include researching emerging tools in the Hadoop ecosystem and how they can be leveraged in an analytics context.

Vikas Raguttahalli is a Research Lead with Mu Sigma’s Innovation & Development Team. He is responsible for working with client delivery teams and helping clients institutionalize Big Data within their organizations, as well as researching new and upcoming Big Data tools. His expertise includes R, MapReduce, Hive, Pig, Mahout and the wider Hadoop ecosystem.

Filed under:

27 Responses
  • Kostiantyn / December 16, 2013 / 10:53 AM

    As I see rimpala.connect doesn’t have possibility to pass login/password. How can this Impala connection be secured?

  • svkerr / December 16, 2013 / 10:14 PM

    This does not solve the R memory bound problem?

  • Austin Chungath / December 17, 2013 / 6:58 AM

    Hi Kostiantyn,
    We are working on creating a secure connection from RImpala. We will let you know when it is done. Please feel free to suggest any more features that you might need.
    Thanks.

  • Kamlesh Kshirsagar / December 19, 2013 / 4:30 AM

    Hello,

    I tried the steps above. I get following error:

    > rimpala.connect(“localhost”,”21050″)
    Error in .jcall(“RJavaTools”, “Ljava/lang/Object;”, “invokeMethod”, cl, :
    java.lang.NoClassDefFoundError: org.apache.hive.jdbc.HiveConnection

    Please can you help me.

    Regards,
    Kamlesh

  • Austin Chungath / December 19, 2013 / 6:14 PM

    Kamlesh,

    Extract the jars from https://downloads.cloudera.com/impala-jdbc/impala-jdbc-0.5-2.zip

    to say /tmp/impala-jdbc

    then start R, assuming that impala is running on 127.0.0.1 at 21050 the following should work:

    library(RImpala)
    rimpala.init(libs=”/tmp/impala-jdbc/”)
    rimpala.connect(IP=”127.0.0.1″,port=”21050″)

  • Austin Chungath / December 19, 2013 / 11:09 PM

    Svkerr,
    The results of your query from Impala that you load into R through the RImpala package will still be bound by how much memory R has at its disposal.

  • Kamlesh Kshirsagar / January 10, 2014 / 6:45 AM

    Hello Austin,

    I followed the steps you mentioned. I am still getting an error.

    > library(RImpala)
    > rimpala.init(libs=”/tmp/impala-jdbc/”)
    [1] “Classpath added successfully”
    > rimpala.connect(IP=”127.0.0.1″,port=”21050″)
    Error in .jcall(“RJavaTools”, “Ljava/lang/Object;”, “invokeMethod”, cl, :
    java.lang.NoClassDefFoundError: org.apache.hive.jdbc.HiveConnection

    Please could you help me on this.

    Regards,

    Kamlesh

  • Austin Chungath / January 14, 2014 / 12:46 AM

    As discussed in our conversation over mail. Connecting to your VM from an R session outside the VM solved the problem. Not sure why the NoClassDefFoundError happens from the R session inside the VM – must be some Java environment settings gone wrong.

    To install RImpala on Windows:
    Download R and install : http://cran.r-project.org/bin/windows/base/
    Download RStudio and install: http://www.rstudio.com/ide/download/

    Copy and extract this https://downloads.cloudera.com/impala-jdbc/impala-jdbc-0.5-2.zip to a location on your file system say C:\impala-jar

    > library(RImpala)
    Loading required package: rJava
    > rimpala.init(libs=”C:\\impala-jar”)
    [1] “Classpath added successfully”
    > rimpala.connect(IP=”192.168.204.164″,port=”21050″)
    [1] TRUE
    > rimpala.invalidate()
    > rimpala.showdatabases()
    name
    1 default

  • Kiran Kaipa / January 15, 2014 / 11:03 PM

    Hi,
    Firstly, I am excited to use this library and trying to get started.
    I am setting up Impala connection using my windows pc version R and RStudio.
    When I had to initialize I had to point to individual jars in the extracted folder for class path. e.g. rimpala.init(libs=”./impala-jar//slf4j-log4j12-1.6.1.jar”)
    for all the jars.
    I get [1] “Classpath added successfully” for all jars.
    Next when I am trying to use rimpala.connect() to connect to my impala server but it returns a NULL error and FALSE message. Can you please help?
    Here is the o/p:
    rimpala.connect(“BISOLHYD3″,”21050″)
    Error: null
    [1] FALSE
    rimpala.connect(“10.136.225.34″,”21050″)
    Error: null
    [1] FALSE

  • Slaton Lipscomb / January 28, 2014 / 7:51 PM

    No problem connecting to Impala, or running commands like showtables, describe, etc. I can run some queries.

    However I am seeing “java.sql.SQLException: Method not supported” for any query I attempt that changes a column name using SELECT .. AS ..

    Example:
    rimpala.query(“SELECT foo AS bar FROM my_table LIMIT 20″)
    Error in .jcall(“RJavaTools”, “Ljava/lang/Object;”, “invokeMethod”, cl, :
    java.sql.SQLException: Method not supported

    Is this a limit in Cloudera’s current JDBC driver?

  • Austin Chungath / January 29, 2014 / 2:32 AM

    Hi Kiran,
    It’s nice to hear from you.
    The RImpala code requires all the jdbc jars to be in a directory and you need to give the path to the directory in the init function.
    Adding one jar at a time will not work as the init function expects a directory. But init function still says “Classpath added successfully” even when you are adding one jar at a time. Thanks for bringing this to my notice.

  • kamal / February 13, 2014 / 11:18 PM

    No problem connecting to Impala, or running commands like showtables, describe, etc. I can run some queries.

    However I am seeing “java.sql.SQLException: Method not supported” for any query I attempt that changes a column name using SELECT .. AS ..

    Example:
    rimpala.query(“SELECT foo AS bar FROM my_table LIMIT 20″)
    Error in .jcall(“RJavaTools”, “Ljava/lang/Object;”, “invokeMethod”, cl, :
    java.sql.SQLException: Method not supported

    I have added all the Required jar file in directory then also getting the same error

  • Austin Chungath / February 25, 2014 / 3:07 AM

    For those who are interested – the source is available at https://github.com/austincv/RImpala

  • Vinodh / March 06, 2014 / 10:51 PM

    Hi,
    I am able to connect. But when I execute a query I am getting following error
    df=rimpala.query(“select id,name from test1″)
    Error in colnames<-(*tmp*, value = c("id", "name")) :
    'names' attribute [2] must be the same length as the vector [1]
    If I put only one field in select query it works.
    Appreciate help

  • Sachin Sudarshana / March 18, 2014 / 11:16 PM

    HI Vinodh,

    I’m not able to replicate the error that you’ve mentioned.
    Could you please check if the table does contain 2 columns?
    If possible, share a sample of your dataset and I will try to resolve the issue.

    Thank you,
    Sachin

  • Vinodh S / March 20, 2014 / 11:58 PM

    Hi Sachin,
    Thanks for looking into this.
    The table Test in Impala is having the structure
    id name
    1 vinodh
    I am using query select * from test.
    Error is same with any query done on any table which fetches more than one column.
    I have checked the source file RImpala.R. The error is in rimpala.query in line colnames(onlyData)=colNames.
    If I select only one column it works fine.
    Hope this helps
    Thanks,
    Vinodh S

  • Eugene Shen / May 07, 2014 / 1:25 AM

    Hi everyone,

    We got a “java.sql.SQLException: Method not supported” error when a query returned a NULL value for a bigint field. I traced the problem to RImpala.java Line 339 (rs.getObject(i) returns null, which causes the toString() to fail).

    Now, the easiest fix for us is to simply add a “where … is not null” to the query, but that might not work for everyone. A simple fix to RImpala.java would be to change this line:

    dynamicRow[i-1]= rs.getObject(i).toString();

    to:

    Object o = rs.getObject(i);
    dynamicRow[i-1] = o == null ? null : o.toString();

    … which I’ve verified works in Java. I’ll leave it up to you guys to decide if you want to roll this “fix” into the next release (since there may be other considerations), but I figure I’d leave this here in case others run into this problem in the future.

    Thanks for the excellent library!
    Eugene.

  • Eugene / May 19, 2014 / 2:22 PM

    Hi,

    I have the same problem as Vinodh. If I select 1 column from the table it works, but does not work for more that 1 column. res <- rimpala.query("select * from hbase_tool where id = 'ABC'"). Error in colnames<-(*tmp*, value = c("id", "tool")) :
    'names' attribute [2] must be the same length as the vector [1].

    Thanks,
    Eugene

  • Austin Chungath / May 28, 2014 / 1:24 AM

    Hi Eugene,

    I believe we have fixed this issue in the current release. I have just published the 0.1.3 release on CRAN. (Thanks to Takekatsu Hiramura for the patch). Can you please verify if your problem is fixed in the current release?
    It was a simple error in the code when the result set had only one row of data and it is not based on the number of columns.

  • Austin Chungath / May 28, 2014 / 1:39 AM

    Hi everyone,

    I don’t check this thread that often. If you also mail the developers from the package DESCRIPTION directly, you can expect a faster response.

    Also, Eugene – Thanks for the suggestion on the NULL value exception. In the next release, I will try to accommodate a parameter in the rimpala.query function that will let the code substitute NULL values with something else, if the user wants that behavior.
    Thanks,
    Austin

  • James Sharrett / June 14, 2014 / 8:17 AM

    I’m seeing an issue present in 0.1.3 when connecting to the default database in impala and querying the sample_07 table. I can return all columns of data except for the ‘salary’ column. It’s an INT like the ‘total_emp’ column which comes back ok.

    This works:
    rimpala.query(“select code,description,total_emp from sample_07″)

    This doesn’t
    rimpala.query(“select code,description,total_emp,salary from sample_07″)

    Error in .jcall(“RJavaTools”, “Ljava/lang/Object;”, “invokeMethod”, cl, :
    java.sql.SQLException: Method not supported

    I get the same error on any query that includes ‘salary’

  • Abhay Godbole / June 16, 2014 / 10:11 PM

    Hi,
    I am interacting with Vikas, but thought of putting the query to this forum as well.

    I have extracted the JDBC zip file in /usr/lib/impala/lib and I also got the message after init function call
    [1] “Classpath added successfully”

    Further I have also checked the org.apache.hive.jdbc.HiveConnection.class file in the jar file, it is there.

    From the error message it is clear that it is related to classpath related issue, I also checked if my environment variable CLASSPATH is pointing to any other path, but it is not set, Where the rimpala.init method sets the classpath?

    I have followed all the steps mentioned in the link that you have provided, but still it is not working and getting following error

    > library(RImpala)
    Loading required package: rJava
    > rimpala.init(libs=”/usr/lib/impala/lib”)
    [1] “Classpath added successfully”
    > rimpala.connect(IP=”127.0.0.1″,port=”21050″)
    Error in .jcall(“RJavaTools”, “Ljava/lang/Object;”, “invokeMethod”, cl, :
    java.lang.NoClassDefFoundError: org.apache.hive.jdbc.HiveConnection
    >

    Please help

  • Benjamin Kim / July 17, 2014 / 10:35 AM

    I installed R onto one of the client gateway nodes of our cluster. The node has CentOS 6.5 as its OS and has Java 7, Cloudera Manager 4.8.0 installed and has CDH 4.6.0 and Impala 1.3.1 parcel distributions there. I then went on to install RImpala per the instructions above and ran the commands as follows:

    > library(RImpala)
    Loading required package: rJava
    > rimpala.init(libs=”/opt/cloudera/parcels/IMPALA/lib/impala/lib”)
    [1] “Classpath added successfully”
    > rimpala.connect(“mint-dn6.lax.adconion.com”,”25010″)
    Error in .jcall(“RJavaTools”, “Ljava/lang/Object;”, “invokeMethod”, cl, :
    java.lang.NoClassDefFoundError: org.apache.hive.jdbc.HiveDriver

    Am I doing something wrong here?

    Please let me know how to get this resolved.

    By the way, this will be a wonderful win for our data scientists. They would no longer have to go against the files directly underlying the hive tables.

    Thanks,
    Ben

  • Jason Donnald / August 21, 2014 / 1:18 PM

    Does the rimpala.query() not work with sql queries like insert into? I tried inserting but always got error as the sql query does not return any resultset

  • Sairaam Varadarajan / October 13, 2014 / 6:55 PM

    > rimpala.connect(“localhost”,”25010″)
    Error in .jcall(“RJavaTools”, “Ljava/lang/Object;”, “invokeMethod”, cl, :
    java.lang.NoClassDefFoundError: org.apache.hive.jdbc.HiveDriver

    I am getting an error as this one.

    Please let me know how to get this resolved

  • Yogesh / October 20, 2014 / 3:27 AM

    Hello,
    I tried posting this query but did not seem to get through so trying again. Apologies if it’s a repeat.

    On a Hadoop cluster, we have written a R program which uses R-Impala, gets data from Impala, does some processing and writes data to Hbase. The program runs fine when run as a standalone on nodes.
    We are now trying to run the same program in parallel on multiple nodes i.e. the same R program will run at the same time on multiple nodes. The data being fetched and processed by each node is different. However, all R programs fail and R-Impala stops working.
    Is there any configuration change which is required to make this work i.e. allow the R code to access Impala from multiple nodes? Perhaps use a different port from each node? Any help will be appreciated.

    Thanks in advance!

  • Yogesh / October 20, 2014 / 3:46 AM

    In continuation to my query above, I’m also posting the Error which we get.
    Note – a) We are using the same port 21050 on all nodes to access Impala.
    b) The Impala query from R usually gets over in a minute so we’ve tried increasing the timeout limit to 4 mins but that has not worked either.

    Error – “Error: ScannerTimeoutException: 65206ms passed since the last invocation, timeout is currently set to 60000
    CAUSED BY: UnknownScannerException: org.apache.hadoop.hbase.UnknownScannerException: No lease for -7349509338182188705: lease ‘-7349509338182188705′ does not exist
    at org.apache.hadoop.hbase.regionserver.HRegionServer.internalNext(HRegionServer.java:2640)
    at org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.java:2609)
    at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:323)
    at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1428)

    Error: RetriesExhaustedException: Failed after attempts=3, exceptions:
    Mon Oct 20 13:12:19 IST 2014, org.apache.hadoop.hbase.client.ScannerCallable@7f301db0, java.io.IOException: java.net.SocketTimeoutException: Call to DP-DL0214-066/10.10.28.196:60020 failed on socket timeout exception: java.net.SocketTimeoutException: 3000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.10.28.196:33359 remote=DP-DL0214-066/10.10.28.196:60020]
    Mon Oct 20 13:12:20 IST 2014, org.apache.hadoop.hbase.client.ScannerCallable@7f301db0, java.io.IOException: java.io.IOException: Call to DP-DL0214-066/10.10.28.196:60020 failed on local exception: java.io.IOException: Unexpected exception receiving call responses
    Mon Oct 20 13:12:24 IST 2014, org.apache.hadoop.hbase.client.ScannerCallable@7f301db0, java.io.IOException: java.io.IOException: Call to DP-DL0214-066/10.10.28.196:60020 failed on local exception: org.apache.hadoop.hbase.ipc.HBaseClient$CallTimeoutException: Call id=144708, waitTime=3030, rpcTimetout=3000″

Leave a comment


three − 2 =