Native Parquet Support Comes to Apache Hive

by Brock Noland

Posted in Technical | February 20, 2014 2 min read

Bringing Parquet support to Hive was a community effort that deserves congratulations!

Previously, this blog introduced Parquet, an efficient ecosystem-wide columnar storage format for Apache Hadoop. As discussed in that blog post, Parquet encodes data extremely efficiently and as described in Google’s original Dremel paper. (For more technical details on the Parquet format read Dremel made simple with Parquet, or go directly to the open and community-driven Parquet Format specification.)

Before discussing the Parquet Hive integration, it’s worth discussing how widely Parquet has been adopted across the Hadoop ecosystem. Parquet integrates with the following engines:

Cloudera Impala
Apache Crunch
Apache Drill
Apache Hadoop MapReduce
Apache Hive (0.10, 0.11, 0.12, and 0.13)
Apache Pig
Apache Spark
Apache Tajo (planned)
Cascading
Pivotal HAWQ

and the following data description software:

Apache Avro
Apache Thrift
Google Protocol Buffers (in code review)

When Parquet was announced, Criteo stepped up to create the Parquet Hive integration. Initially this integration was hosted within the Parquet project and shipped with CDH 4.5. However, as the momentum behind Parquet grew, users wanted to use Parquet with a variety of Hive versions. Therefore, the Parquet team determined that native integration with the Hive project would be easier to maintain, as Hive does not have well defined public/private APIs. Furthermore, as can be seen below, native integration greatly simplifies the CREATE TABLE command.

As such, the Parquet team decided to move the Parquet Hive integration into the Hive project via HIVE-5783. A diverse set of Parquet and Hive contributors came together to commit native Parquet support to Hive 0.13. Most notably, Criteo engineers Justin Coffey, Mickaël Lacour, and Remy Pecqueur donated the Hive Parquet integration to the Hive project.

The end result of this work is that users of Hive 0.13 and CDH 5 can easily create Parquet tables in Hive:

CREATE TABLE parquet_test (
 id int,
 str string,
 mp MAP<STRING,STRING>,
 lst ARRAY,
 struct STRUCT<A:STRING,B:STRING>)
PARTITIONED BY (part string)
STORED AS PARQUET;

Users of CDH 4.5 and Hive 0.10, 0.11, and 0.12 can continue to use Parquet Hive from the Parquet project proper, by using the older more verbose CREATE TABLE syntax. To create a table in Hive 0.10, 0.11, or 0.12, use the syntax below:

CREATE TABLE parquet_test (
 id int,
 str string,
 mp MAP<STRING,STRING>,
 lst ARRAY,
 strct STRUCT<A:STRING,B:STRING>)
PARTITIONED BY (part string)
ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe'
STORED AS
INPUTFORMAT 'parquet.hive.DeprecatedParquetInputFormat'
OUTPUTFORMAT 'parquet.hive.DeprecatedParquetOutputFormat';

Thanks to everyone who contributed to this work!

Brock Noland is a Software Engineer at Cloudera and a Hive Committer.

Brock Noland

More by this author

Editor's Choice

Business

Generative AI for the Enterprise

Technical

Building Trust in Public Sector AI Starts with Trusting Your Data

Native Parquet Support Comes to Apache Hive

Editor's Choice

Leave a comment Cancel reply