We’re very happy to re-publish the following post from Twitter analytics infrastructure engineering manager Dmitriy Ryaboy (@squarecog).
In March we announced the Parquet project, the result of a collaboration between Twitter and Cloudera intended to create an open-source columnar storage format library for Apache Hadoop.
Today, we’re happy to tell you about a significant Parquet milestone: a 1.0 release, which includes major features and improvements made since the initial announcement. But first, we’ll revisit why columnar storage is so important for the Hadoop ecosystem.
What is Parquet and Columnar Storage?
Parquet is an open-source columnar storage format for Hadoop. Its goal is to provide a state of the art columnar storage layer that can be taken advantage of by existing Hadoop frameworks, and can enable a new generation of Hadoop data processing architectures such as Impala, Drill, and parts of the Hive ‘Stinger’ initiative. Parquet does not tie its users to any existing processing framework or serialization library.
The idea behind columnar storage is simple: instead of storing millions of records row by row (employee name, employee age, employee address, employee salary…) store the records column by column (all the names, all the ages, all the addresses, all the salaries). This reorganization provides significant benefits for analytical processing:
- Since all the values in a given column have the same type, generic compression tends to work better and type-specific compression can be applied.
- Since column values are stored consecutively, a query engine can skip loading columns whose values it doesn’t need to answer a query, and use vectorized operators on the values it does load.
These effects combine to make columnar storage a very attractive option for analytical processing.
Implementing a columnar storage format that can be used by the many various Hadoop-based processing engines is tricky. Not all data people store in Hadoop is a simple table — complex nested structures abound. For example, one of Twitter’s common internal datasets has a schema nested seven levels deep, with over 80 leaf nodes. Slicing such objects into a columnar structure is non-trivial, and we chose to use the approach described by Google engineers in their paper Dremel: Interactive Analysis of Web-Scale Datasets. Another complexity derives from the fact that we want it to be relatively easy to plug new processing frameworks into Parquet.
Despite these challenges, we are pleased with the results so far: our approach has resulted in integration with Hive, Pig, Cascading, Impala, and an in-progress implementation with Drill, and is currently in production at Twitter.
What’s in the Parquet 1.0 Release
Parquet 1.0 is available for download on GitHub and via Maven Central. It provides the following features:
- Apache Hadoop Map-Reduce Input and Output formats
- Apache Pig Loaders and Storers
- Apache Hive SerDes
- Cascading Shemes
- Impala support
- Self-tuning dictionary encoding
- Dynamic Bit-Packing / RLE encoding
- Ability to work directly with Avro records
- Ability to work directly with Thrift records
- Support for both Hadoop 1 and Hadoop 2 APIs
28 contributors from multiple organizations (Twitter, Cloudera, Criteo, UC Berkeley AMPLab, Stripe and others) contributed to this release.
Improvements since Initial Parquet Release
When we announced Parquet, we encouraged the greater Hadoop community to contribute to the design and implementation of the format. Parquet 1.0 features many contributions from the community as well as the initial core team of committers; we will highlight two of these improvements below.
Dictionary encoding
The ability to efficiently encode columns in which the number of unique values is fairly small (10s of thousands) can lead to a significant compression and processing speed boost. Nong Li and Marcel Kornacker of Cloudera teamed up with Julien Le Dem of Twitter to define a dictionary encoding specification, and implemented it both in Java and C++ (for Impala). Parquet’s dictionary encoding is automatic, so users do not have to specify it — Parquet will dynamically turn it on and off as applicable, given the data it is compressing.
Hybrid bit packing and RLE encoding
Columns of numerical values can often be efficiently stored using two approaches: bit packing and run-length encoding (RLE):
- Bit packing uses the fact that small integers do not need a full 32 or 64 bits to be represented, and packs multiple values into the space normally occupied by a single value. There are multiple ways to do this, but we use a modified version of Daniel Lemire’s JavaFastPFOR library (read more about it here).
- Run-length encoding turns “runs” of the same value, meaning multiple occurrences of the same value in a row, into just a pair of numbers: the value, and the number of times it is repeated.
Our hybrid implementation of bit-packing and RLE monitors the data stream, and dynamically switches between the two types of encoding, depending on what gives us the best compression. This is extremely effective for certain kinds of integer data and combines particularly well with dictionary encoding.
A Growing Community
One of the major goals of Parquet is to provide a columnar storage format for Hadoop that can be used by many projects and companies, rather than being tied to a specific tool. We are heartened to find that so far, the bet on open-source collaboration and tool independence has paid off. This release includes quality contributions from 18 developers, who are affiliated with a number of different companies and institutions.
Looking Forward
We are far from done, of course; there are many improvements we would like to make and features we would like to add. To learn more about these items, you can see our roadmap on the README or look for the “pick me up!” label on GitHub. You can also join our mailing list at parquet-dev@googlegroups.com and tweet at us @ParquetFormat to join the discussion.