Cloudera Blog · Impala Posts
As a follow-up to a previous post about the Impala demo he built during Data Hacking Day, Alan Gardner from Pythian has deployed the app for a limited time on Amazon EC2. We republish his original post below.
A little while ago I blogged about (and open sourced) a Cloudera Impala-powered soccer visualization demo, designed to demonstrate just how responsive Impala queries can be. Since not everyone has the time or resources to run the project themselves, we’ve decided to host it ourselves on an EC2 instance. [Note: instance live only for one week!] You can try the visualization; we’ve also opened up the Impala web interface, where you can see query profiles and performance numbers, and Hue (username and password are both ‘test’), where you can run your own queries on the dataset.
Deploying Impala on EC2
While there are many tools to deploy a Hadoop cluster on EC2 – like Apache Whirr, or even Cloudera Manager – I only wanted to use a single instance for the entire cluster. Starting from the base Ubuntu (Precise) image, I added Cloudera’s apt repos, and installed the single node configuration. Impala doesn’t support using Derby for the Hive metastore, so I installed MySQL and configured Hive to use it instead. Then I installed Impala using Cloudera’s instructions. Impala, and all of the Hadoop daemons, are running comfortably on one M3 2XLarge EC2 instance. Given our modest demands, this may actually be overkill; I over-spec’ed the server trying to find a (now-obvious) performance problem involving short-circuit reads.
Cloudera Manager 4.5 includes a new express installation wizard for Amazon Web Services (AWS) EC2. (This feature is also available in Cloudera Manager Free Edition.) Its goal is to enable Cloudera Manager users to provision CDH clusters and Cloudera Impala (the new open source distributed query engine for Apache Hadoop) on EC2 as easily as possible - and thus is currently the fastest way to provision a Cloudera Manager-managed cluster in EC2.
The new distinguishing feature is that Cloudera Manager can now launch and configure the instances for you, so you don’t have to worry about launching the instances, authorizing SSH keys, and configuring a firewall. All this can now be done from within Cloudera Manager!
Since Cloudera Manager and the nodes running CDH use internal hostnames to communicate, the Cloudera Manager server must run on EC2 as well. In fact, the Cloud Express Wizard only appears when installing Cloudera Manager on EC2.
The following guest post comes to you from Alan Gardner of remote database services and consulting company Pythian, who participated in Data Hacking Day (and was on the winning team!) at Cloudera’s offices in February.
Last Feb. 25, just prior to attending Strata, Alex Gorbachev (our CTO) and I had the chance to visit Cloudera’s Palo Alto offices for Data Hacking Day. The goal of the event was to produce something cool that leverages Cloudera Impala – the new open source, low-latency platform for querying data in Apache Hadoop.
Our hosts helpfully suggested some datasets, including the DEBS 2013 Grand Challenge data. This dataset contains the position of all the players and ball during a football match; our project was to map the data for a given span of time and player onto a map of the field, to create a heatmap of how much time that player spent at different positions.
Below you’ll find the official announcement from Cloudera and Twitter about Parquet, an efficient general-purpose columnar file format for Apache Hadoop.
Parquet is designed to bring efficient columnar storage to Hadoop. Compared to, and learning from, the initial work done toward this goal in Trevni, Parquet includes the following enhancements:
It has been a busy time for announcements coinciding with this week’s Strata conference. There’s no corner of the technology world that has not embraced Apache Hadoop as the new platform for big data. Apache Hadoop began as a telegram from the future from Google, turned into real software by Doug Cutting while on a freelance assignment. While Hadoop’s origins are surprising, its ongoing popularity is not – open source has been a major contributing factor to Hadoop’s current ubiquity. Easy to trial, fast to evolve, inexpensive to own: open source makes a compelling case for itself.
From the founding of the company, Cloudera recognized the importance of Apache open source to Hadoop’s continued evolution. We’re now entering our fifth year of shipping a 100% open source platform. Every significant advance we have added to the platform has stayed consistent to our open source strategy. In the process Cloudera has now sponsored the development of seven new open source projects including Apache Flume, Apache Sqoop, Apache Bigtop, Apache MRUnit, Cloudera Hue, Apache Crunch, and most recently, Cloudera Impala. Acknowledging the maxim “innovation happens elsewhere,” we’ve also managed to convince the founders and/or PMC chairs of Apache Hadoop, Apache Oozie, Apache Zookeeper, and Apache HBase to come join Cloudera.
Our investment in open source is not altruistic — we think it is good business. Today, Cloudera employees contribute more patches to the Apache Hadoop ecosystem than every other software vendor combined. Meanwhile more enterprises have adopted our open source platform than every other Hadoop distribution combined. We do not think it is a coincidence that these two things are simultaneously true.
Today is an exciting day for Cloudera customers and users. With an update to our 100% open source platform and a number of new add-on products, every software component we ship is getting either a minor or major update. There’s a lot to cover and this blog post is only a summary. In the coming weeks we’ll do follow-on blog posts that go deeper into each of these releases.
We’re now supporting several hundred production Hadoop clusters. In doing so we’ve had to make a lot of advances in the functionality, reliability and manageability of the Hadoop platform. Even with these improvements, customers have been traditionally reluctant to run certain data and applications on the Apache Hadoop platform. The new products we are announcing today were designed to remove these obstacles to adoption.
Now that Apache Hadoop is seven years old, use-case patterns for Big Data have emerged. In this post, I’m going to describe the three main ones (reflected in the post’s title) that we see across Cloudera’s growing customer base.
Transformations (T, for short) are a fundamental part of BI systems: They are the process through which data is converted from a source format (which can be relational or otherwise) into a relational data model that can be queried via BI tools.
In the late 1980s, the first BI data stacks started to materialize, and they typically looked like Figure 1.
Cloudera Impala, the open-source real-time query engine for Apache Hadoop, uses many tools and techniques to get the best query performance. This blog post will discuss how we use runtime code generation to significantly improve our CPU efficiency and overall query execution time. We’ll explain the types of inefficiency that code-generation eliminates and go over in more detail one of the queries in the TPCH workload where code generation improves overall query speeds by close to 3x.
Why Code Generation?
The baseline for “optimal” query engine performance is a native application that is written specifically for your data format, written only to support your query. For example, it would be ideal if a query engine could execute this query:
select count(*) from tbl where col like %XYZ%
This was post was originally published by U.C. Berkeley AMPLab developer (and former Clouderan) Matt Massie, on his personal blog. Matt has graciously permitted us to re-publish here for your convenience.
Note: The post below is valid for Impala version 0.6 only and is not being maintained for subsequent releases. To deploy Impala 0.7 and later using a much easier (and also free) method, use this how-to.
Cloudera Impala provides fast, interactive SQL queries directly on your Apache Hadoop data stored in HDFS or Apache HBase.
Thanks to Stripe’s Colin Marc (@colinmarc) for the guest post below, and for his work on the world’s first Ruby client for Cloudera Impala!
Like most other companies, at Stripe it has become increasingly hard to answer the big and interesting questions as datasets get bigger. This is pretty insidious: the set of potential interesting questions also grows as you acquire more data. Answering questions like, “Which regions have the most developers per capita?” or “How do different countries compare in how they spend online?” might involve hours of scripting, waiting, and generally lots of lost developer time.
Up to now, the answer has often been Apache Hive, which at least made it easy to express many of these queries. Unfortunately, Hive queries are typically very slow. Cloudera Impala provides a similar front-end while being orders of magnitude faster, and we’ve found it immensely useful in many different situations at Stripe. With the near real-time results, the notion of performing programmatic (and not just ad-hoc) queries has now become more attractive.