Cloudera Developer Blog · Impala Posts
A quick on-ramp (and demo) for using the new Sentry module for RBAC in conjunction with Hive
One attribute of the Enterprise Data Hub is fine-grained access to data by users and apps. This post about supporting infrastructure for that goal was originally published at blogs.apache.org. We republish it here for your convenience.
Apache Sentry (incubating) is a highly modular system for providing fine-grained role-based authorization to both data and metadata stored on an Apache Hadoop cluster. It currently works out of the box with Apache Hive and Cloudera Impala. In this blog post, you will learn how to use Sentry with Hive.
Thanks to Victor Bittorf, a visiting graduate computer science student at Stanford University, for the guest post below about how to use the new prebuilt analytic functions for Cloudera Impala.
Cloudera Impala is an exciting project that unlocks interactive queries and SQL analytics on big data. Over the past few months I have been working with the Impala team to extend Impala’s analytic capabilities. Today I am happy to announce the availability of pre-built mathematical and statistical algorithms for the Impala community under a free open-source license. These pre-built algorithms combine recent theoretical techniques for shared nothing parallelization for analytics and the new user-defined aggregations (UDA) framework in Impala 1.2 in order to achieve big data scalability. This initial release has support for logistic regression, support vector machines (SVMs), and linear regression.
Having recently completed my masters degree while working in the database systems group at University of Madison Wisconsin, I’m excited to work with the Impala team on this project while I continue my research as a visiting student at Stanford. I’m going to go through some details about what we’ve implemented and how to use it.
As a delicious appetizer for the Strata Conference + Hadoop World next week (sold out!), O’Reilly Media has partnered with us to create and publish a new e-book specifically intended for technical end-users of Cloudera Impala, the open source distributed query engine for Apache Hadoop.
Authored by Cloudera’s own John Russell, the e-book provides a 30-page tour of Impala’s internals and architecture, as well as common usage patterns intended for mainstream (SQL) users.
As John explains in his introductory post on O’Reilly’s Strata blog:
The following Parquet blog post was originally published by Salesforce.com Lead Engineer and Apache Pig Committer Prashant Kommireddi (@pRaShAnT1784). Prashant has kindly given us permission to re-publish below. Parquet is an open source columnar storage format co-founded by Twitter and Cloudera.
Parquet is a columnar storage format for Apache Hadoop that uses the concept of repetition/definition levels borrowed from Google Dremel. It provides efficient encoding and compression schemes, the efficiency being improved due to application of aforementioned on a per-column basis (compression is better as column values would all be the same type, encoding is better as values within a column could often be the same and repeated). Here is a nice blog post from Julien Le Dem of Twitter describing Parquet internals.
Parquet can be used by any project in the Hadoop ecosystem, there are integrations provided for MR, Pig, Hive, Cascading, and Cloudera Impala.
The following post was originally published by the Hue Team at the Hue blog in a slightly different form.
Hue, the open source web GUI that makes Apache Hadoop easy to use, has supported Cloudera Impala since its inception to enable fast, interactive SQL queries from within your browser. In this post, you’ll see a demo of Hue’s Impala app in action and explore its impressive query speed for yourself.
Impala App Demo
The demo below compares some queries across Hue’s Apache Hive and Impala applications. (Impala supports a broad range of SQL and HiveQL commands.) Although this comparison is not scientific, it does reflect general user experience across common cases.
In December 2012, we described how an internal application built on CDH called Cloudera Support Interface (CSI), which drastically improves Cloudera’s ability to optimally support our customers, is a unique and instructive use case for Apache Hadoop. In this post, we’ll follow up by describing two new differentiating CSI capabilities that have made Cloudera Support yet more responsive for customers:
In December 2012, while Cloudera Impala was still in its beta phase, we provided a roadmap for planned functionality in the production release. In the same spirit of keeping Impala users, customers, and enthusiasts well informed, this post provides an updated roadmap for upcoming releases later this year and in early 2014.
But first, a thank-you: Since the initial beta release, we’ve received a tremendous amount of feedback and validation about Impala — copious in its quality as well as quantity. At least one person in approximately 4,500 unique organizations around the world have downloaded the Impala binary, to date. And even after only a few months of GA, we’ve seen Cloudera Enterprise customers from multiple industries deploy Impala 1.x in business-critical environments with support via a Cloudera RTQ (Real-Time Query) subscription — including leading organizations in insurance, banking, retail, healthcare, gaming, government, telecom, and advertising.
Furthermore, based on the reaction from other vendors in the data management space, few observers would dispute the notion that Impala has made low-latency, interactive SQL queries for Hadoop as important a customer requirement as the high-latency, batch-oriented SQL queries enabled by Apache Hive. That’s a great development for Hadoop users everywhere!
What Was Delivered in Impala 1.0/1.1
The guest post below is provided by Justin Langseth, Founder & CEO of Zoomdata, Inc. Thanks, Justin!
What if you could affordably manage billions of rows of raw Big Data and let typical business people analyze it at the speed of thought in beautiful, interactive visuals? What if you could do all the above without worrying about structuring that data in a data warehouse schema, moving it, and pre-defining reports and dashboards? With the approach I’ll describe below, you can.
The traditional Apache Hadoop approach — in which you store all your data in HDFS and do batch processing through MapReduce — works well for data geeks and data scientists, who can write MapReduce jobs and wait hours for them to run before asking the next question. But many businesses have never even heard of Hadoop, don’t employ a data scientist, and want their data questions answered in a second or two — not in hours.
Cloudera Impala has made huge progress since its initial announcement – and there’s even more good news on the roadmap. To learn more, plan to attend an Impala meetup hosted by Cloudera in its San Francisco offices on the evening of Aug. 20:
We’re very happy to re-publish the following post from Twitter analytics infrastructure engineering manager Dmitriy Ryaboy (@squarecog).
Today, we’re happy to tell you about a significant Parquet milestone: a 1.0 release, which includes major features and improvements made since the initial announcement. But first, we’ll revisit why columnar storage is so important for the Hadoop ecosystem.