Cloudera Engineering Blog · Impala Posts
The number of powerful data query tools in the Apache Hadoop ecosystem can be confusing, but understanding a few simple things about your needs usually makes the choice easy.
Ah, the good old days. I recall vividly that in 2007, I was faced to store 1 billion XML documents and make them accessible as well as searchable. I had few choices on a given shoestring budget: build something one my own (it was the rage back then—and still is), use an existing open source database like PostgreSQL or MySQL, or try this thing that Google built successfully and that was now implemented in open source under the Apache umbrella: Hadoop.
Impala authentication can now be handled by a combination of LDAP and Kerberos. Here’s why, and how.
Impala, the open source analytic database for Apache Hadoop, supports authentication—the act of proving you are who you say you are—using both Kerberos and LDAP. Kerberos has been supported since release 1.0, LDAP support was added more recently, and with CDH 5.2, you can use both at the same time.
This new feature, jointly developed by Cloudera and Intel engineers, makes management of role-based security much easier in Apache Hive, Impala, and Hue.
Apache Sentry (incubating) provides centralized authorization for services and applications in the Apache Hadoop ecosystem, allowing administrators to set up granular, role-based protection on resources, and to review them in one place. Previously, Sentry only designated administrators to
REVOKE privileges on an authorizable object. In Apache Sentry 1.5.0 (shipping inside CDH 5.2), we have implemented a new feature (SENTRY-327) that allows admin users to delegate the
GRANT privilege to other users using
WITH GRANT OPTION. If a user has the
GRANT OPTION privilege on a specific resource, the user can now grant the
GRANT privilege to other users on the same resource. Apache Hive, Impala, and Hue have all been updated to take advantage of this new Sentry functionality.
Impala 2.0 is the most SQL-complete/SQL-compatible release yet.
As we reported in the most recent roadmap update (“What’s Next for Impala: Focus on Advanced SQL Functionality”), more complete SQL functionality (and better SQL compatibility with other vendor extensions) is a major theme in Impala 2.0.
Getting Started with Impala (now in early release)—another book in the Hadoop ecosystem books canon—is indispensable for people who want to get familiar with Impala, the open source MPP query engine for Apache Hadoop. We spoke with its author, Impala docs writer John Russell, about the book’s origin and mission.
Why did you decide to write this book?
With 1.4, Impala’s performance lead over the SQL-on-Hadoop ecosystem gets wider, especially under multi-user load.
As noted in our recent post about the Impala 2.x roadmap (“What’s Next for Impala: Focus on Advanced SQL Functionality”), Impala’s ecosystem momentum continues to accelerate, with nearly 1 million downloads since the GA of 1.0, deployment by most of Cloudera’s enterprise data hub customers, and adoption by MapR, Amazon, and Oracle as a shipping product. Furthermore, in the past few months, independent sources such as IBM Research have confirmed that “Impala’s database-like architecture provides significant performance gains, compared to Hive’s MapReduce- or Tez-based runtime.”
Our thanks to Melanie Imhof, Jonas Looser, Thierry Musy, and Kurt Stockinger of the Zurich University of Applied Science in Switzerland for the post below about their research into the query performance of Impala for mixed workloads.
Recently, we were approached by an industry partner to research and create a blueprint for a new Big Data, near real-time, query processing architecture that would replace its current architecture based on a popular open source database system.
Impala 2.0 will add much more complete SQL functionality to what is already the fastest SQL-on-Hadoop solution available.
In September 2013, we provided a roadmap for Impala — the open source MPP SQL query engine for Apache Hadoop, which was on release 1.1 at the time — that documented planned functionality through release 2.0 and beyond.
Applications using HDFS, such as Impala, will be able to read data up to 59x faster thanks to this new feature.
Server memory capacity and bandwidth have increased dramatically over the last few years. Beefier servers make in-memory computation quite attractive, since a lot of interesting data sets can fit into cluster memory, and memory is orders of magnitude faster than disk.
Impala continues to demonstrate performance leadership compared to alternatives (by 950% or more), while providing greater query throughput and with a far smaller CPU footprint.
In our previous post from January 2014, we reported that Impala had achieved query performance over Apache Hadoop equivalent to that of an analytic DBMS over its own proprietary storage system. We believed this was an important milestone because Impala’s objective has been to support a high-quality BI experience on Hadoop data, not to produce a “faster Apache Hive.” An enterprise-quality BI experience requires low latency and high concurrency (among other things), so surpassing a well-known proprietary MPP DBMS in these areas was important evidence of progress.
In the past nine months, we’ve also all seen additional public validation that the original technical design for Hive, while effective for batch processing, was a dead-end for BI workloads. Recent examples have included the launch of Facebook’s Presto engine (Facebook was the inventor and world’s largest user of Hive), the emergence of Shark (Hive running on the Apache Spark DAG), and the “Stinger” initiative (Hive running on the Apache Tez [incubating] DAG).
Given the introduction of a number of new SQL-on-Hadoop implementations it seemed like a good time to do a roundup of the latest versions of each engine to see how they differ. We find that Impala maintains a significant performance advantage over the various other open source alternatives — ranging from 5x to 23x depending on the workload and the implementations that are compared. This advantage is due to some inherent design differences among the various systems, which we’ll explain below. Impala’s advantage is strongest for multi-user workloads, which arguably is the most relevant measure for users evaluating their options for BI use cases.