Category Archives: Parquet

Graduating Apache Parquet

Categories: Guest Parquet

The following post from Julien Le Dem, a tech lead at Twitter, originally appeared in the Twitter Engineering Blog. We bring it to you here for your convenience.

ASF, the Apache Software Foundation, recently announced the graduation of Apache Parquet, a columnar storage format for the Apache Hadoop ecosystem. At Twitter, we’re excited to be a founding member of the project.

Apache Parquet is built to work across programming languages,

Read more

Using Apache Parquet at AppNexus

Categories: Guest Impala Parquet Performance

Thanks to Chen Song, Data Team Lead at AppNexus, for allowing us to republish the following post about his company’s use case for Apache Parquet (incubating at this writing), the open standard for columnar storage across the Apache Hadoop ecosystem.

At AppNexus, over 2MM log events are ingested into our data pipeline every second. Log records are sent from upstream systems in the form of Protobuf messages. Raw logs are compressed in Snappy when stored on HDFS.

Read more

Converting Apache Avro Data to Parquet Format in Apache Hadoop

Categories: Avro Guest Hadoop Parquet

Thanks to Big Data Solutions Architect Matthieu Lieber for allowing us to republish the post below.

A customer of mine wants to take advantage of both worlds: work with his existing Apache Avro data, with all of the advantages that it confers, but take advantage of the predicate push-down features that Parquet provides. How to reconcile the two?

For more information about combining these formats,

Read more

Progress Report: Community Contributions to Parquet

Categories: Community Parquet

Community contributions to Parquet are increasing in parallel with its adoption. Here are some of the highlights.

Apache Parquet (incubating), the open source, general-purpose columnar storage format for Apache Hadoop, was co-founded only 18 months ago by Cloudera and Twitter. Since that time, its rapid adoption by multiple platform vendors and communities has made it a de facto standard for this purpose.

Parquet logo

Most of Cloudera’s recent contributions to have focused on fixing bugs reported by its growing number of users.

Read more

New SQL Choices in the Apache Hadoop Ecosystem: Why Impala Continues to Lead

Categories: Impala Parquet Performance

Impala continues to demonstrate performance leadership compared to alternatives (by 950% or more), while providing greater query throughput and with a far smaller CPU footprint.

In our previous post from January 2014, we reported that Impala had achieved query performance over Apache Hadoop equivalent to that of an analytic DBMS over its own proprietary storage system. We believed this was an important milestone because Impala’s objective has been to support a high-quality BI experience on Hadoop data,

Read more