Cloudera Engineering Blog · Impala Posts
Impala continues to demonstrate performance leadership compared to alternatives (by 950% or more), while providing greater query throughput and with a far smaller CPU footprint.
In our previous post from January 2014, we reported that Impala had achieved query performance over Apache Hadoop equivalent to that of an analytic DBMS over its own proprietary storage system. We believed this was an important milestone because Impala’s objective has been to support a high-quality BI experience on Hadoop data, not to produce a “faster Apache Hive.” An enterprise-quality BI experience requires low latency and high concurrency (among other things), so surpassing a well-known proprietary MPP DBMS in these areas was important evidence of progress.
In the past nine months, we’ve also all seen additional public validation that the original technical design for Hive, while effective for batch processing, was a dead-end for BI workloads. Recent examples have included the launch of Facebook’s Presto engine (Facebook was the inventor and world’s largest user of Hive), the emergence of Shark (Hive running on the Apache Spark DAG), and the “Stinger” initiative (Hive running on the Apache Tez [incubating] DAG).
Given the introduction of a number of new SQL-on-Hadoop implementations it seemed like a good time to do a roundup of the latest versions of each engine to see how they differ. We find that Impala maintains a significant performance advantage over the various other open source alternatives — ranging from 5x to 23x depending on the workload and the implementations that are compared. This advantage is due to some inherent design differences among the various systems, which we’ll explain below. Impala’s advantage is strongest for multi-user workloads, which arguably is the most relevant measure for users evaluating their options for BI use cases.
In its relatively short lifetime (co-founded by Twitter and Cloudera in July 2013), Parquet has already become the de facto standard for columnar storage of Apache Hadoop data — with native support in Impala, Apache Hive, Apache Pig, Apache Spark, MapReduce, Apache Tajo, Apache Drill, Apache Crunch, and Cascading (and forthcoming in Presto and Shark). Parquet adoption is also broad-based, with employees of the following companies (partial list) actively contributing:
Learn how HiveServer, Apache Sentry, and Impala help make Hadoop play nicely with BI tools when Kerberos is involved.
In 2010, I wrote a simple pair of blog entries outlining the general considerations behind using Apache Hadoop with BI tools. The Cloudera partner ecosystem has positively exploded since then, and the technology has matured as well. Today, if JDBC is involved, all the pieces needed to expose Hadoop data through familiar BI tools are available:
Our thanks to Don Drake (@dondrake), an independent technology consultant who is currently working as a Principal Big Data Consultant at Allstate Insurance, for the guest post below about his experiences with Impala.
It started with a simple request from one of the managers in my group at Allstate to put together a demo of Tableau connecting to Cloudera Impala. I had previously worked on Impala with a large dataset about a year ago while it was still in beta, and was curious to see how Impala had improved since then in features and stability.
The new Python client for Impala will bring smiles to Pythonistas!
As a data scientist, I love using the Python data stack. I also love using Impala to work with very large data sets. But things that take me out of my Python workflow are generally considered hassles; so it’s annoying that my main options for working with Impala are to write shell scripts, use the Impala shell, and/or transfer query results by reading/writing local files to disk.
Thanks to Alexander Rubin of Percona for allowing us to re-publish the post below!
Apache Hadoop is commonly used for data analysis. It is fast for data loads and scalable. In a previous post I showed how to integrate MySQL with Hadoop. In this post I will show how to export a table from MySQL to Hadoop, load the data to Cloudera Impala (columnar format), and run reporting on top of that. For the examples below, I will use the “ontime flight performance” data from my previous post.
The following post, by Sarah Cannon of Digital Reasoning, was originally published in that company’s blog. Digital Reasoning has graciously permitted us to re-publish here for your convenience.
At the beginning of each release cycle, engineers at Digital Reasoning are given time to explore the latest in Big Data technologies, examining how the frequently changing landscape might be best adapted to serve our mission. As we sat down in the early stages of planning for Synthesys 3.8 one of the biggest issues we faced involved reconciling the tradeoff between flexibility and performance. How can users quickly and easily retrieve knowledge from Synthesys without being tied to one strict data model?
The CDH software stack lets you use your tool of choice with the Parquet file format – - offering the benefits of columnar storage at each phase of data processing.
An open source project co-founded by Twitter and Cloudera, Parquet was designed from the ground up as a state-of-the-art, general-purpose, columnar file format for the Apache Hadoop ecosystem. In particular, Parquet has several features that make it highly suited to use with Cloudera Impala for data warehouse-style operations:
This quick demo illustrates how easy it is to implement role-based access and control in Impala using Sentry.
Apache Sentry (incubating) is the Apache Hadoop ecosystem tool for role-based access control (RBAC). In this how-to, I will demonstrate how to implement Sentry for RBAC in Impala. I feel this introduction is best motivated by a use case.
Cloudera’s own enterprise data hub is yielding great results for providing world-class customer support.
Here at Cloudera, we are constantly pushing the envelope to give our customers world-class support. One of the cornerstones of this effort is the Cloudera Support Interface (CSI), which we’ve described in prior blog posts (here and here). Through CSI, our support team is able to quickly reason about a customer’s environment, search for information related to a case currently being worked, and much more.