Cloudera Engineering Blog · Impala Posts
Every day, more data, users, and applications are accessing ever-larger Apache Hadoop clusters. Although this is good news for data driven organizations overall, for security administrators and compliance officers, there are still lingering questions about how to enable end-users under existing Hadoop infrastructure without compromising security or compliance requirements.
While Hadoop has strong security at the filesystem level, it lacks the granular support needed to adequately secure access to data by users and BI applications. Today, this problem forces organizations in industries for which security is paramount (such as financial services, healthcare, and government) to make a choice: either leave data unprotected or lock out users entirely. Most of the time, the preferred choice is the latter, severely inhibiting access to data in Hadoop.
Editor’s note (added Feb. 2, 2014): You can review the latest (and exciting) Impala performance benchmark results by Cloudera here.
In the presentation below, Scott Leberknight of Near Infinity has done such a good and thorough job of dissecting Cloudera Impala, we want to share it with you here.
For years, Cloudera has provided virtual machines that give you a working Apache Hadoop environment out-of-the-box. It’s the quickest way to learn and experiment with Hadoop right from your desktop.
We’re constantly updating and improving the QuickStart VM, and in the latest release there are two of Cloudera’s new products that give you easier and faster access to your data: Cloudera Search and Cloudera Impala. We’ve also added corresponding applications to Hue – an open source web-based interface for Hadoop, and the easiest way to interact with your data.
Data analysts and business intelligence specialists have been at the heart of new trends driving business growth over the past decade, including log file and social media analytics. However, Big Data heretofore has been beyond the reach of analysts because traditional tools like relational databases don’t scale, and scalable systems like Apache Hadoop have historically required Java expertise.
Cloudera Impala has many exciting features, but one of the most impressive is the ability to analyze data in multiple formats, with no ETL needed, in HDFS and Apache HBase. Furthermore, you can use multiple frameworks, such as MapReduce and Impala, to analyze that same data. Consequently, Impala will often run side-by-side with MapReduce on the same physical hardware, with both supporting business-critical workloads. For such multi-tenant clusters, Impala and MapReduce both need to perform well despite potentially conflicting demands for cluster resources.
In this post, we’ll share our experiences configuring Impala and MapReduce for optimal multi-tenant performance. Our goal is to help users understand how to tune their multi-tenant clusters to meet production service level objectives (SLOs), and to contribute to the community some test methods and performance models that can be helpful beyond Cloudera.
Defining Realistic Test Scenarios
Our thanks to Brian Dirking, Director of Product Marketing for Alteryx, for the guest post below:
At Alteryx we are excited about the release of Cloudera Impala. The impact on Big Data Analytics is that the ability to perform real-time queries on Apache Hadoop will provide faster access and results. This is applicable to our customers, the business users who are running analytics to get access to data, perform analytics, and then follow up with new questions. Insight doesn’t happen all at once. The ability to query and refine quickly is ultimately what will lead business users to insight.
“Are data warehouses becoming victims of their own success?”, Tony Baer asks in a recent blog post:
Our thanks to Ted Wasserman, product manager for Tableau, for the guest post below:
Many of our customers are turning to Apache Hadoop as they grapple with their big data challenges. Hadoop offers many benefits such as its scalability, economics, and versatility. Even so, adoption-to-date has largely centered around applications with “batch”-oriented workloads because of the latency imposed by the MapReduce framework. To increase Hadoop’s usefulness and adoption in the business intelligence space where users need fast, interactive response times when they ask a question, a new approach was needed.
Our thanks to Yves de Montcheuil, Vice President of Marketing for Talend, for the guest post below:
According to Wikipedia, the impala is a medium-sized African antelope; its name comes from the Zulu language meaning “gazelle”. Like elephants, it is found in savannas, and this may be the link with Hadoop. Impala is also the name of Cloudera’s SQL-on-Apache Hadoop project, launched in beta at Strata last October and just released in version 1.0.
Our thanks to Kevin Spurway, Senior Vice President of Marketing for MicroStrategy Inc., for the guest post below:
Squeezing insight from Big Data isn’t easy. It’s a delicate balance between scalability, performance, and cost effectiveness across an entire architecture, spanning everything from data storage to mobile app consumption. That’s why MicroStrategy and Cloudera have been working closely together from a technology standpoint. And, that’s why we’re proud to stand as a launch partner, certifying the integration between Cloudera’s new Impala project and our core MicroStrategy enterprise analytics platform.
Impala is a giant step toward an era of highly cost-effective interactive analytics for Hadoop-based Big Data.
This week represents quite a milestone for Cloudera and, at least we’d like to believe, the Hadoop ecosystem at large: the general availability release of Cloudera Impala. Since we launched the Impala beta program last fall, I’ve been fortunate enough to work with many of the 40+ early adopters who’ve been testing this near-real-time SQL-on-Hadoop engine in an effort to learn about their use cases and keep tabs on early experiences with the tool.
Customers running Impala today span a variety of industries, from large biotech company to online travel provider to digital advertiser to major financial institution, and each one has a unique use case for Impala. Stay tuned to learn more about their various use cases.
On Monday April 29, Cloudera announced a strategic alliance with SAS. As the industry leader in business analytics software, SAS brings a formidable toolset to bear on the problem of extracting business value from large volumes of data.
Over the past few months, Cloudera has been hard at work along with the SAS team to integrate a number of SAS products with Apache Hadoop, delivering the ability for our customers to use these tools in their interaction with data on the Cloudera platform. In this post, we will delve into the major mechanisms that are available for connecting SAS to CDH, Cloudera’s 100% open-source distribution including Hadoop.
SAS/ACCESS to Hadoop
In October 2012, we introduced the Impala project, at that time the first known effort to bring a modern, open source, distributed SQL query engine to Apache Hadoop. Our release of source code and a beta implementation were met with widespread acclaim — and later inspired similar efforts across the industry that now measure themselves against the Impala standard.
Today, we are proud to announce the first production drop of Impala (download here), which reflects feedback from across the user community based on multiple types of real-world workloads. Just as a refresher, the main design principle behind Impala is complete integration with the Hadoop platform (jointly utilizing a single pool of storage, metadata model, security framework, and set of system resources). This integration allows Impala users to take advantage of the time-tested cost, flexibility, and scale advantages of Hadoop for interactive SQL queries, and makes SQL a first-class Hadoop citizen alongside MapReduce and other frameworks. The net result is that all your data becomes available for interactive analysis simultaneously with all other types of processing, with no ETL delays needed.
It has been an exciting couple of days for new product announcements at Cloudera — exciting especially for me as the edges of the new platform for big data we have been talking about since Strata + Hadoop World 2012 come into focus.
Yesterday, Cloudera announced a strategic alliance with SAS. SAS is the industry leader in business analytics software, especially predictive analytics. Ninety percent of the Fortune 100 run SAS today. We have been working with SAS to make a number of its products work well with Cloudera including SAS Access, SAS Visual Analytics, and SAS High Performance Analytics (HPA). SAS HPA is an excellent case example of the future direction of Apache Hadoop as a data management platform:
It’s time for me to give you a quarterly update (here’s the one for Q1) about where to find tech talks by Cloudera employees in 2013. Committers, contributors, and other engineers will travel to meetups and conferences near and far to do their part in the community to make Apache Hadoop a household word!
(Remember, we’re always ready to assist your meetup by providing speakers, sponsorships, and schwag.)
As a follow-up to a previous post about the Impala demo he built during Data Hacking Day, Alan Gardner from Pythian has deployed the app for a limited time on Amazon EC2. We republish his original post below.
A little while ago I blogged about (and open sourced) a Cloudera Impala-powered soccer visualization demo, designed to demonstrate just how responsive Impala queries can be. Since not everyone has the time or resources to run the project themselves, we’ve decided to host it ourselves on an EC2 instance. [Note: instance live only for one week!] You can try the visualization; we’ve also opened up the Impala web interface, where you can see query profiles and performance numbers, and Hue (username and password are both ‘test’), where you can run your own queries on the dataset.
Deploying Impala on EC2
Editor’s Note (added Feb. 25, 2015): For releases beyond 4.5, Cloudera recommends the use of Cloudera Director for deploying CDH in cloud environments.
Cloudera Manager includes a new express installation wizard for Amazon Web Services (AWS) EC2. Its goal is to enable Cloudera Manager users to provision CDH clusters and Cloudera Impala (the open source distributed query engine for Apache Hadoop) on EC2 as easily as possible (for testing and development purposes only, not supported for production workloads) - and thus is currently the fastest way to provision a Cloudera Manager-managed cluster in EC2.
The following guest post comes to you from Alan Gardner of remote database services and consulting company Pythian, who participated in Data Hacking Day (and was on the winning team!) at Cloudera’s offices in February.
Last Feb. 25, just prior to attending Strata, Alex Gorbachev (our CTO) and I had the chance to visit Cloudera’s Palo Alto offices for Data Hacking Day. The goal of the event was to produce something cool that leverages Cloudera Impala – the new open source, low-latency platform for querying data in Apache Hadoop.
Below you’ll find the official announcement from Cloudera and Twitter about Parquet, an efficient general-purpose columnar file format for Apache Hadoop.
Parquet is designed to bring efficient columnar storage to Hadoop. Compared to, and learning from, the initial work done toward this goal in Trevni, Parquet includes the following enhancements:
It has been a busy time for announcements coinciding with this week’s Strata conference. There’s no corner of the technology world that has not embraced Apache Hadoop as the new platform for big data. Apache Hadoop began as a telegram from the future from Google, turned into real software by Doug Cutting while on a freelance assignment. While Hadoop’s origins are surprising, its ongoing popularity is not – open source has been a major contributing factor to Hadoop’s current ubiquity. Easy to trial, fast to evolve, inexpensive to own: open source makes a compelling case for itself.
From the founding of the company, Cloudera recognized the importance of Apache open source to Hadoop’s continued evolution. We’re now entering our fifth year of shipping a 100% open source platform. Every significant advance we have added to the platform has stayed consistent to our open source strategy. In the process Cloudera has now sponsored the development of seven new open source projects including Apache Flume, Apache Sqoop, Apache Bigtop, Apache MRUnit, Cloudera Hue, Apache Crunch, and most recently, Cloudera Impala. Acknowledging the maxim “innovation happens elsewhere,” we’ve also managed to convince the founders and/or PMC chairs of Apache Hadoop, Apache Oozie, Apache Zookeeper, and Apache HBase to come join Cloudera.
Today is an exciting day for Cloudera customers and users. With an update to our 100% open source platform and a number of new add-on products, every software component we ship is getting either a minor or major update. There’s a lot to cover and this blog post is only a summary. In the coming weeks we’ll do follow-on blog posts that go deeper into each of these releases.
Now that Apache Hadoop is seven years old, use-case patterns for Big Data have emerged. In this post, I’m going to describe the three main ones (reflected in the post’s title) that we see across Cloudera’s growing customer base.
Transformations (T, for short) are a fundamental part of BI systems: They are the process through which data is converted from a source format (which can be relational or otherwise) into a relational data model that can be queried via BI tools.
Cloudera Impala, the open-source real-time query engine for Apache Hadoop, uses many tools and techniques to get the best query performance. This blog post will discuss how we use runtime code generation to significantly improve our CPU efficiency and overall query execution time. We’ll explain the types of inefficiency that code-generation eliminates and go over in more detail one of the queries in the TPCH workload where code generation improves overall query speeds by close to 3x.
Why Code Generation?
The baseline for “optimal” query engine performance is a native application that is written specifically for your data format, written only to support your query. For example, it would be ideal if a query engine could execute this query:
This was post was originally published by U.C. Berkeley AMPLab developer (and former Clouderan) Matt Massie, on his personal blog. Matt has graciously permitted us to re-publish here for your convenience.
Note: The post below is valid for Impala version 0.6 only and is not being maintained for subsequent releases. To deploy Impala 0.7 and later using a much easier (and also free) method, use this how-to.
Thanks to Stripe’s Colin Marc (@colinmarc) for the guest post below, and for his work on the world’s first Ruby client for Cloudera Impala!
Like most other companies, at Stripe it has become increasingly hard to answer the big and interesting questions as datasets get bigger. This is pretty insidious: the set of potential interesting questions also grows as you acquire more data. Answering questions like, “Which regions have the most developers per capita?” or “How do different countries compare in how they spend online?” might involve hours of scripting, waiting, and generally lots of lost developer time.
I am pleased to announce the release of Cloudera Impala Beta (version 0.4) and Cloudera Manager 4.1.3. Key enhancements in each release are:
Cloudera Impala Beta (version 0.4)
For several good reasons, 2013 is a Happy New Year for Apache Hadoop enthusiasts.
In 2012, we saw continued progress on developing the next generation of the MapReduce processing framework (MRv2), work that will bear fruit this year. HDFS experienced major progress toward becoming a lights-out, fully enterprise-ready distributed filesystem with the addition of high availability features and increased performance. And a hint of the future of the Hadoop platform was provided with the Beta release of Cloudera Impala, a real-time query engine for analytics across HDFS and Apache HBase data.
(Update 2/6/2013 – Sorry, this event is sold out!)
With Strata Conference 2013 coming to town (Feb. 26-28, in Santa Clara, Calif.), we thought it would be a great opportunity to open our Palo Alto office’s doors for a pre-conference “Data Hacking Day” on Monday, Feb. 25!
In this installment of “Meet the Engineer”, meet Marcel Kornacker, the architect of the Cloudera Impala open-source real-time query engine for Apache Hadoop.
In this installment of “Meet the Engineer”, meet Nong Li, a software engineer working on the open-source Cloudera Impala real-time query engine.
What do you do at Cloudera?
It’s been an exciting month and a half since the launch of the Cloudera Impala (the new open source distributed query engine for Apache Hadoop) beta, and we thought it’d be a great time to provide an update about what’s next for the project – including our product roadmap, release schedule and open-source plan.
First of all, we’d like to thank you for your enthusiasm and valuable beta feedback. We’re actively listening and have already fixed many of the bugs reported, captured feature requests for the roadmap, and updated the Cloudera Impala FAQ based on user input.
At Cloudera, we put great pride into drinking our own champagne. That pride extends to our support team, in particular.
Cloudera Manager, our end-to-end management platform for CDH (Cloudera’s open-source, enterprise-ready distribution of Apache Hadoop and related projects), has a feature that allows subscription customers to send a snapshot of their cluster to us. When these cluster snapshots come to us from customers, they end up in a CDH cluster at Cloudera where various forms of data processing and aggregation can be performed.
I am pleased to announce the release of Cloudera Impala Beta (version 0.3) and Cloudera Manager 4.1.2. Key enhancements in each release are:
Cloudera Impala Beta (version 0.3)
The beta release of Cloudera Impala, the first (and open source) real-time query engine for Apache Hadoop, has been out in the wild (in binary as well as VM forms) for over a month now, and users have had time to get up-close and hands-on. Consequently, we’re beginning to see some fascinating self-published observations and guides.
Since the Cloudera Impala announcement of a few weeks ago, we’ve been busy partnering-up with Hadoop meetups around the country (and beyond) to bring Impala tech talks directly to the community. Here’s the list for the remainder of 2012, thus far:
I am pleased to announce the release of Cloudera Impala Beta (version 0.2) and Cloudera Manager 4.1.1. These are both enhancement releases to make bug fixes available quickly. Key enhancements in each release are:
Cloudera Impala Beta (version 0.2)
[Updated Nov. 26, 2012: Sorry, this event has reached capacity and is now closed.]
Please join us in New York on Nov. 29, 2012, for a unique opportunity to hear from industry icons Jeff Hammerbacher (@hackingdata), Amr Awadallah (@awadallah) and Josh Wills (@josh_wills) as they discuss their approach to Data Science and how it transformed business for companies like Facebook, Yahoo! and Google. You will also hear more about Cloudera Enterprise: The Platform for Big Data powered by Cloudera Impala, which takes Hadoop “beyond batch” and into the world of real-time interactivity.
I am very pleased to announce the availability of Cloudera Manager 4.1. This release adds support for the Cloudera Impala beta release, and management and monitoring of key CDH features.
Here are the highlights of Cloudera Manager 4.1:
After a long period of intense engineering effort and user feedback, we are very pleased, and proud, to announce the Cloudera Impala project. This technology is a revolutionary one for Hadoop users, and we do not take that claim lightly.
When Google published its Dremel paper in 2010, we were as inspired as the rest of the community by the technical vision to bring real-time, ad hoc query capability to Apache Hadoop, complementing traditional MapReduce batch processing. Today, we are announcing a fully functional, open-sourced codebase that delivers on that vision – and, we believe, a bit more – which we call Cloudera Impala. An Impala binary is now available in public beta form, but if you would prefer to test-drive Impala via a pre-baked VM, we have one of those for you, too. (Links to all downloads and documentation are here.) You can also review the source code and testing harness at Github right now.
Today we’re proud to announce a new addition to the Apache Hadoop ecosystem: Cloudera Impala, a parallel SQL engine that runs natively on Hadoop storage. The salient points are: