Native Parquet Support Comes to Apache Hive
Bringing Parquet support to Hive was a community effort that deserves congratulations!
Previously, this blog introduced Parquet, an efficient ecosystem-wide columnar storage format for Apache Hadoop. As discussed in that blog post, Parquet encodes data extremely efficiently and as described in Google’s original Dremel paper. (For more technical details on the Parquet format read Dremel made simple with Parquet, or go directly to the open and community-driven Parquet Format specification.)
Before discussing the Parquet Hive integration, it’s worth discussing how widely Parquet has been adopted across the Hadoop ecosystem. Parquet integrates with the following engines:
- Cloudera Impala
- Apache Crunch
- Apache Drill
- Apache Hadoop MapReduce
- Apache Hive (0.10, 0.11, 0.12, and 0.13)
- Apache Pig
- Apache Spark
- Apache Tajo (planned)
and the following data description software:
- Apache Avro
- Apache Thrift
- Google Protocol Buffers (in code review)
When Parquet was announced, Criteo stepped up to create the Parquet Hive integration. Initially this integration was hosted within the Parquet project and shipped with CDH 4.5. However, as the momentum behind Parquet grew, users wanted to use Parquet with a variety of Hive versions. Therefore, the Parquet team determined that native integration with the Hive project would be easier to maintain, as Hive does not have well defined public/private APIs. Furthermore, as can be seen below, native integration greatly simplifies the
CREATE TABLE command.
As such, the Parquet team decided to move the Parquet Hive integration into the Hive project via HIVE-5783. A diverse set of Parquet and Hive contributors came together to commit native Parquet support to Hive 0.13. Most notably, Criteo engineers Justin Coffey, Mickaël Lacour, and Remy Pecqueur donated the Hive Parquet integration to the Hive project.
The end result of this work is that users of Hive 0.13 and CDH 5 can easily create Parquet tables in Hive:
CREATE TABLE parquet_test ( id int, str string, mp MAP<STRING,STRING>, lst ARRAY<STRING>, struct STRUCT<A:STRING,B:STRING>) PARTITIONED BY (part string) STORED AS PARQUET;
Users of CDH 4.5 and Hive 0.10, 0.11, and 0.12 can continue to use Parquet Hive from the Parquet project proper, by using the older more verbose
CREATE TABLE syntax. To create a table in Hive 0.10, 0.11, or 0.12, use the syntax below:
CREATE TABLE parquet_test ( id int, str string, mp MAP<STRING,STRING>, lst ARRAY<STRING>, strct STRUCT<A:STRING,B:STRING>) PARTITIONED BY (part string) ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'parquet.hive.DeprecatedParquetInputFormat' OUTPUTFORMAT 'parquet.hive.DeprecatedParquetOutputFormat';
Thanks to everyone who contributed to this work!
Brock Noland is a Software Engineer at Cloudera and a Hive Committer.