Native Parquet Support Comes to Apache Hive

Categories: Hive Impala Parquet

Bringing Parquet support to Hive was a community effort that deserves congratulations!

Previously, this blog introduced Parquet, an efficient ecosystem-wide columnar storage format for Apache Hadoop. As discussed in that blog post, Parquet encodes data extremely efficiently and as described in Google’s original Dremel paper. (For more technical details on the Parquet format read Dremel made simple with Parquet, or go directly to the open and community-driven Parquet Format specification.)

Before discussing the Parquet Hive integration, it’s worth discussing how widely Parquet has been adopted across the Hadoop ecosystem. Parquet integrates with the following engines:

  • Cloudera Impala
  • Apache Crunch
  • Apache Drill
  • Apache Hadoop MapReduce
  • Apache Hive (0.10, 0.11, 0.12, and 0.13)
  • Apache Pig
  • Apache Spark
  • Apache Tajo (planned)
  • Cascading
  • Pivotal HAWQ

and the following data description software:

  • Apache Avro
  • Apache Thrift
  • Google Protocol Buffers (in code review)

When Parquet was announced, Criteo stepped up to create the Parquet Hive integration. Initially this integration was hosted within the Parquet project and shipped with CDH 4.5. However, as the momentum behind Parquet grew, users wanted to use Parquet with a variety of Hive versions. Therefore, the Parquet team determined that native integration with the Hive project would be easier to maintain, as Hive does not have well defined public/private APIs. Furthermore, as can be seen below, native integration greatly simplifies the CREATE TABLE command.

As such, the Parquet team decided to move the Parquet Hive integration into the Hive project via HIVE-5783. A diverse set of Parquet and Hive contributors came together to commit native Parquet support to Hive 0.13. Most notably, Criteo engineers Justin Coffey, Mickaƫl Lacour, and Remy Pecqueur donated the Hive Parquet integration to the Hive project.

The end result of this work is that users of Hive 0.13 and CDH 5 can easily create Parquet tables in Hive:


Users of CDH 4.5 and Hive 0.10, 0.11, and 0.12 can continue to use Parquet Hive from the Parquet project proper, by using the older more verbose CREATE TABLE syntax. To create a table in Hive 0.10, 0.11, or 0.12, use the syntax below:


Thanks to everyone who contributed to this work!

Brock Noland is a Software Engineer at Cloudera and a Hive Committer.


4 responses on “Native Parquet Support Comes to Apache Hive

  1. Rakesh Neelavar Rao

    I get an error when i use “parquet.hive.MapredParquetOutputFormat” as the output format in CDH 4.5
    OK FAILED: SemanticException [Error 10055]: Output Format must implement HiveOutputFormat, otherwise it should be either IgnoreKeyTextOutputFormat or SequenceFileOutputFormat

    One of the other cloudera docs mentioned using the below “parquet.hive.DeprecatedParquetOutputFormat”

  2. Brock

    CDH 4.5 users should use “parquet.hive.DeprecatedParquetInputFormat” and “parquet.hive.DeprecatedParquetOutputFormat”. We updated the blog post to reflect this. A future version of CDH will contain the Mapred* classes.

Leave a Reply

Your email address will not be published. Required fields are marked *