Cloudera Engineering Blog · Impala Posts
With the close of 2013, we also thought it appropriate to include some high points from across the year (not listed in any particular order):
Developers, rejoice: Impala is now available on EMR for testing and evaluation.
Very recently, Amazon Web Services announced support for running Cloudera Impala queries on its Elastic MapReduce (EMR) service. This is very good news for EMR users — as well as for users of other platforms interested in kicking Impala’s tires in a friction-free way. It’s also yet another sign that Impala is rapidly being adopted across the ecosystem as the gold standard for interactive SQL and BI queries on Apache Hadoop.
The new RImpala package brings the speed and interactivity of Impala to queries from R.
Our thanks to Austin Chungath, Sachin Sudarshana, and Vikas Raguttahalli of Mu Sigma, a Decision Sciences and Big Data analytics company, for the guest post below.
A quick on-ramp (and demo) for using the new Sentry module for RBAC in conjunction with Hive
One attribute of the Enterprise Data Hub is fine-grained access to data by users and apps. This post about supporting infrastructure for that goal was originally published at blogs.apache.org. We republish it here for your convenience.
Thanks to Victor Bittorf, a visiting graduate computer science student at Stanford University, for the guest post below about how to use the new prebuilt analytic functions for Cloudera Impala.
Cloudera Impala is an exciting project that unlocks interactive queries and SQL analytics on big data. Over the past few months I have been working with the Impala team to extend Impala’s analytic capabilities. Today I am happy to announce the availability of pre-built mathematical and statistical algorithms for the Impala community under a free open-source license. These pre-built algorithms combine recent theoretical techniques for shared nothing parallelization for analytics and the new user-defined aggregations (UDA) framework in Impala 1.2 in order to achieve big data scalability. This initial release has support for logistic regression, support vector machines (SVMs), and linear regression.
As a delicious appetizer for the Strata Conference + Hadoop World next week (sold out!), O’Reilly Media has partnered with us to create and publish a new e-book specifically intended for technical end-users of Cloudera Impala, the open source distributed query engine for Apache Hadoop.
Authored by Cloudera’s own John Russell, the e-book provides a 30-page tour of Impala’s internals and architecture, as well as common usage patterns intended for mainstream (SQL) users.
The following Parquet blog post was originally published by Salesforce.com Lead Engineer and Apache Pig Committer Prashant Kommireddi (@pRaShAnT1784). Prashant has kindly given us permission to re-publish below. Parquet is an open source columnar storage format co-founded by Twitter and Cloudera.
Parquet is a columnar storage format for Apache Hadoop that uses the concept of repetition/definition levels borrowed from Google Dremel. It provides efficient encoding and compression schemes, the efficiency being improved due to application of aforementioned on a per-column basis (compression is better as column values would all be the same type, encoding is better as values within a column could often be the same and repeated). Here is a nice blog post from Julien Le Dem of Twitter describing Parquet internals.
The following post was originally published by the Hue Team at the Hue blog in a slightly different form.
Hue, the open source web GUI that makes Apache Hadoop easy to use, has supported Cloudera Impala since its inception to enable fast, interactive SQL queries from within your browser. In this post, you’ll see a demo of Hue’s Impala app in action and explore its impressive query speed for yourself.
Impala App Demo
In December 2012, we described how an internal application built on CDH called Cloudera Support Interface (CSI), which drastically improves Cloudera’s ability to optimally support our customers, is a unique and instructive use case for Apache Hadoop. In this post, we’ll follow up by describing two new differentiating CSI capabilities that have made Cloudera Support yet more responsive for customers:
In December 2012, while Cloudera Impala was still in its beta phase, we provided a roadmap for planned functionality in the production release. In the same spirit of keeping Impala users, customers, and enthusiasts well informed, this post provides an updated roadmap for upcoming releases later this year and in early 2014.
But first, a thank-you: Since the initial beta release, we’ve received a tremendous amount of feedback and validation about Impala — copious in its quality as well as quantity. At least one person in approximately 4,500 unique organizations around the world have downloaded the Impala binary, to date. And even after only a few months of GA, we’ve seen Cloudera Enterprise customers from multiple industries deploy Impala 1.x in business-critical environments with support via a Cloudera RTQ (Real-Time Query) subscription — including leading organizations in insurance, banking, retail, healthcare, gaming, government, telecom, and advertising.