Bringing Parquet support to Hive was a community effort that deserves congratulations!
Previously, this blog introduced Parquet, an efficient ecosystem-wide columnar storage format for Apache Hadoop. As discussed in that blog post, Parquet encodes data extremely efficiently and as described in Google’s original Dremel paper. (For more technical details on the Parquet format read Dremel made simple with Parquet, or go directly to the open and community-driven Parquet Format specification.)
Before discussing the Parquet Hive integration,
Migrating from the Hive CLI to Beeline isn’t as simple as changing the executable name, but this post makes it easy nonetheless.
In its original form, Apache Hive was a heavyweight command-line tool that accepted queries and executed them utilizing MapReduce. Later, the tool split into a client-server model, in which HiveServer1 is the server (responsible for compiling and monitoring MapReduce jobs) and Hive CLI is the command-line interface (sends SQL to the server).
Cloudera provides docs and a sample build environment to help you get easily started writing your own Impala UDFs.
User-defined functions (UDFs) let you code your own application logic for processing column values during a Cloudera Impala query. For example, a UDF could perform calculations using an external math library, combine several column values into one, do geospatial calculations, or other kinds of tests and transformations that are outside the scope of the built-in SQL operators and functions.
Impala’s speed now beats the fastest SQL-on-Hadoop alternatives. Test for yourself!
Since the initial beta release of Cloudera Impala more than one year ago (October 2012), we’ve been committed to regularly updating you about its evolution into the standard for running interactive SQL queries across data in Apache Hadoop and Hadoop-based enterprise data hubs. To briefly recap where we are today:
- Impala is being widely adopted.
A quick on-ramp (and demo) for using the new Sentry module for RBAC in conjunction with Hive
One attribute of the Enterprise Data Hub is fine-grained access to data by users and apps. This post about supporting infrastructure for that goal was originally published at blogs.apache.org. We republish it here for your convenience.
Apache Sentry (incubating) is a highly modular system for providing fine-grained role-based authorization to both data and metadata stored on an Apache Hadoop cluster.