Cloudera Developer Blog · Impala Posts
It has been a busy time for announcements coinciding with this week’s Strata conference. There’s no corner of the technology world that has not embraced Apache Hadoop as the new platform for big data. Apache Hadoop began as a telegram from the future from Google, turned into real software by Doug Cutting while on a freelance assignment. While Hadoop’s origins are surprising, its ongoing popularity is not – open source has been a major contributing factor to Hadoop’s current ubiquity. Easy to trial, fast to evolve, inexpensive to own: open source makes a compelling case for itself.
From the founding of the company, Cloudera recognized the importance of Apache open source to Hadoop’s continued evolution. We’re now entering our fifth year of shipping a 100% open source platform. Every significant advance we have added to the platform has stayed consistent to our open source strategy. In the process Cloudera has now sponsored the development of seven new open source projects including Apache Flume, Apache Sqoop, Apache Bigtop, Apache MRUnit, Cloudera Hue, Apache Crunch, and most recently, Cloudera Impala. Acknowledging the maxim “innovation happens elsewhere,” we’ve also managed to convince the founders and/or PMC chairs of Apache Hadoop, Apache Oozie, Apache Zookeeper, and Apache HBase to come join Cloudera.
Our investment in open source is not altruistic — we think it is good business. Today, Cloudera employees contribute more patches to the Apache Hadoop ecosystem than every other software vendor combined. Meanwhile more enterprises have adopted our open source platform than every other Hadoop distribution combined. We do not think it is a coincidence that these two things are simultaneously true.
Today is an exciting day for Cloudera customers and users. With an update to our 100% open source platform and a number of new add-on products, every software component we ship is getting either a minor or major update. There’s a lot to cover and this blog post is only a summary. In the coming weeks we’ll do follow-on blog posts that go deeper into each of these releases.
We’re now supporting several hundred production Hadoop clusters. In doing so we’ve had to make a lot of advances in the functionality, reliability and manageability of the Hadoop platform. Even with these improvements, customers have been traditionally reluctant to run certain data and applications on the Apache Hadoop platform. The new products we are announcing today were designed to remove these obstacles to adoption.
Now that Apache Hadoop is seven years old, use-case patterns for Big Data have emerged. In this post, I’m going to describe the three main ones (reflected in the post’s title) that we see across Cloudera’s growing customer base.
Transformations (T, for short) are a fundamental part of BI systems: They are the process through which data is converted from a source format (which can be relational or otherwise) into a relational data model that can be queried via BI tools.
In the late 1980s, the first BI data stacks started to materialize, and they typically looked like Figure 1.
Cloudera Impala, the open-source real-time query engine for Apache Hadoop, uses many tools and techniques to get the best query performance. This blog post will discuss how we use runtime code generation to significantly improve our CPU efficiency and overall query execution time. We’ll explain the types of inefficiency that code-generation eliminates and go over in more detail one of the queries in the TPCH workload where code generation improves overall query speeds by close to 3x.
Why Code Generation?
The baseline for “optimal” query engine performance is a native application that is written specifically for your data format, written only to support your query. For example, it would be ideal if a query engine could execute this query:
select count(*) from tbl where col like %XYZ%
This was post was originally published by U.C. Berkeley AMPLab developer (and former Clouderan) Matt Massie, on his personal blog. Matt has graciously permitted us to re-publish here for your convenience.
Note: The post below is valid for Impala version 0.6 only and is not being maintained for subsequent releases. To deploy Impala 0.7 and later using a much easier (and also free) method, use this how-to.
Cloudera Impala provides fast, interactive SQL queries directly on your Apache Hadoop data stored in HDFS or Apache HBase.
Thanks to Stripe’s Colin Marc (@colinmarc) for the guest post below, and for his work on the world’s first Ruby client for Cloudera Impala!
Like most other companies, at Stripe it has become increasingly hard to answer the big and interesting questions as datasets get bigger. This is pretty insidious: the set of potential interesting questions also grows as you acquire more data. Answering questions like, “Which regions have the most developers per capita?” or “How do different countries compare in how they spend online?” might involve hours of scripting, waiting, and generally lots of lost developer time.
Up to now, the answer has often been Apache Hive, which at least made it easy to express many of these queries. Unfortunately, Hive queries are typically very slow. Cloudera Impala provides a similar front-end while being orders of magnitude faster, and we’ve found it immensely useful in many different situations at Stripe. With the near real-time results, the notion of performing programmatic (and not just ad-hoc) queries has now become more attractive.
Programmatic Access with Ruby
I am pleased to announce the release of Cloudera Impala Beta (version 0.4) and Cloudera Manager 4.1.3. Key enhancements in each release are:
Cloudera Impala Beta (version 0.4)
For several good reasons, 2013 is a Happy New Year for Apache Hadoop enthusiasts.
In 2012, we saw continued progress on developing the next generation of the MapReduce processing framework (MRv2), work that will bear fruit this year. HDFS experienced major progress toward becoming a lights-out, fully enterprise-ready distributed filesystem with the addition of high availability features and increased performance. And a hint of the future of the Hadoop platform was provided with the Beta release of Cloudera Impala, a real-time query engine for analytics across HDFS and Apache HBase data.
Let’s look at the highlights of the 2012 developments around projects supported by Cloudera.
Apache Hadoop Releases
(Update 2/6/2013 – Sorry, this event is sold out!)
With Strata Conference 2013 coming to town (Feb. 26-28, in Santa Clara, Calif.), we thought it would be a great opportunity to open our Palo Alto office’s doors for a pre-conference “Data Hacking Day” on Monday, Feb. 25!
Participants will use Cloudera Impala, the open-source, real-time query engine for Apache Hadoop, to hack on a rich public data set. After forming teams, you’ll compete to see whose project will earn enough votes to win the data-hacking trophy for the day. All members of the winning team will get free hard copies of Eric Sammer’s coveted O’Reilly book, Hadoop Operations.
In this installment of “Meet the Engineer”, meet Marcel Kornacker, the architect of the Cloudera Impala open-source real-time query engine for Apache Hadoop.
What do you do at Cloudera?