Native Parquet Support Comes to Apache Hive

Bringing Parquet support to Hive was a community effort that deserves congratulations!

Previously, this blog introduced Parquet, an efficient ecosystem-wide columnar storage format for Apache Hadoop. As discussed in that blog post, Parquet encodes data extremely efficiently and as described in Google’s original Dremel paper. (For more technical details on the Parquet format read Dremel made simple with Parquet, or go directly to the open and community-driven Parquet Format specification.)

Before discussing the Parquet Hive integration, it’s worth discussing how widely Parquet has been adopted across the Hadoop ecosystem. Parquet integrates with the following engines:

  • Cloudera Impala
  • Apache Crunch
  • Apache Drill
  • Apache Hadoop MapReduce
  • Apache Hive (0.10, 0.11, 0.12, and 0.13)
  • Apache Pig
  • Apache Spark
  • Apache Tajo (planned)
  • Cascading
  • Pivotal HAWQ

and the following data description software:

  • Apache Avro
  • Apache Thrift
  • Google Protocol Buffers (in code review)

When Parquet was announced, Criteo stepped up to create the Parquet Hive integration. Initially this integration was hosted within the Parquet project and shipped with CDH 4.5. However, as the momentum behind Parquet grew, users wanted to use Parquet with a variety of Hive versions. Therefore, the Parquet team determined that native integration with the Hive project would be easier to maintain, as Hive does not have well defined public/private APIs. Furthermore, as can be seen below, native integration greatly simplifies the CREATE TABLE command.

As such, the Parquet team decided to move the Parquet Hive integration into the Hive project via HIVE-5783. A diverse set of Parquet and Hive contributors came together to commit native Parquet support to Hive 0.13. Most notably, Criteo engineers Justin Coffey, Mickaël Lacour, and Remy Pecqueur donated the Hive Parquet integration to the Hive project.

The end result of this work is that users of Hive 0.13 and CDH 5 can easily create Parquet tables in Hive:

 

Users of CDH 4.5 and Hive 0.10, 0.11, and 0.12 can continue to use Parquet Hive from the Parquet project proper, by using the older more verbose CREATE TABLE syntax. To create a table in Hive 0.10, 0.11, or 0.12, use the syntax below:

 

Thanks to everyone who contributed to this work!

Brock Noland is a Software Engineer at Cloudera and a Hive Committer.

Filed under:

4 Responses
  • Rakesh Neelavar Rao / February 21, 2014 / 3:37 PM

    I get an error when i use “parquet.hive.MapredParquetOutputFormat” as the output format in CDH 4.5
    OK FAILED: SemanticException [Error 10055]: Output Format must implement HiveOutputFormat, otherwise it should be either IgnoreKeyTextOutputFormat or SequenceFileOutputFormat

    One of the other cloudera docs mentioned using the below “parquet.hive.DeprecatedParquetOutputFormat”

  • Brock / February 23, 2014 / 6:06 PM

    CDH 4.5 users should use “parquet.hive.DeprecatedParquetInputFormat” and “parquet.hive.DeprecatedParquetOutputFormat”. We updated the blog post to reflect this. A future version of CDH will contain the Mapred* classes.

  • Vladimir Rodionov / February 27, 2014 / 10:47 AM

    It would be nice to have some numbers as well (perf, storage).

  • Brock Noland / February 27, 2014 / 11:44 AM

    There is performance numbers on this Hadoop World presentation: http://www.slideshare.net/julienledem/parquet-stratany-hadoopworld2013

Leave a comment


× 3 = twenty one