Category Archives: Parquet

Using Apache Parquet at AppNexus

Categories: Guest Impala Parquet Performance

Thanks to Chen Song, Data Team Lead at AppNexus, for allowing us to republish the following post about his company’s use case for Apache Parquet (incubating at this writing), the open standard for columnar storage across the Apache Hadoop ecosystem.

At AppNexus, over 2MM log events are ingested into our data pipeline every second. Log records are sent from upstream systems in the form of Protobuf messages. Raw logs are compressed in Snappy when stored on HDFS.

Read more

Converting Apache Avro Data to Parquet Format in Apache Hadoop

Categories: Avro Guest Hadoop Parquet

Thanks to Big Data Solutions Architect Matthieu Lieber for allowing us to republish the post below.

A customer of mine wants to take advantage of both worlds: work with his existing Apache Avro data, with all of the advantages that it confers, but take advantage of the predicate push-down features that Parquet provides. How to reconcile the two?

For more information about combining these formats,

Read more

Progress Report: Community Contributions to Parquet

Categories: Community Parquet

Community contributions to Parquet are increasing in parallel with its adoption. Here are some of the highlights.

Apache Parquet (incubating), the open source, general-purpose columnar storage format for Apache Hadoop, was co-founded only 18 months ago by Cloudera and Twitter. Since that time, its rapid adoption by multiple platform vendors and communities has made it a de facto standard for this purpose.

Parquet logo

Most of Cloudera’s recent contributions to have focused on fixing bugs reported by its growing number of users.

Read more

This Month in the Ecosystem (May 2014)

Categories: Hadoop Parquet Platform Security & Cybersecurity Spark

Welcome to our ninth edition of “This Month in the Ecosystem,” a digest of highlights from May/early June 2014 (never intended to be comprehensive; for that, see the excellent Hadoop Weekly).

More good news!

  • Hadoop Summit San Jose 2014 wrapped up. Every attendee will have a different lens on the experience, but for me, the main takeaway was the increasingly mainstream presence of the enterprise juggernaut called Apache Hadoop.

Read more

New SQL Choices in the Apache Hadoop Ecosystem: Why Impala Continues to Lead

Categories: Impala Parquet Performance

Impala continues to demonstrate performance leadership compared to alternatives (by 950% or more), while providing greater query throughput and with a far smaller CPU footprint.

In our previous post from January 2014, we reported that Impala had achieved query performance over Apache Hadoop equivalent to that of an analytic DBMS over its own proprietary storage system. We believed this was an important milestone because Impala’s objective has been to support a high-quality BI experience on Hadoop data,

Read more