Fixes in CDH 5.5 make writing Parquet data for Apache Impala (incubating) much easier.
Over the last few months, several Cloudera customers have provided the feedback that Parquet is too hard to configure, with the main problem being finding the right layout for great performance in Impala. For that reasons, CDH 5.5 contains new features that make those configuration problems go away.
Auto-Detection of HDFS Block Size
The new support for complex types in Impala makes running analytic workloads considerably simpler.
Impala 2.3 (shipping starting in Cloudera Enterprise 5.5) contains support for querying complex types in Apache Parquet tables, specifically ARRAY, MAP, and STRUCTs. This capability enables users to query against naturally nested data sets without having to perform ETL to flatten them. This feature provides a few major benefits, including:
- It removes additional ETL and data modeling work to flatten data sets.
The following post from Julien Le Dem, a tech lead at Twitter, originally appeared in the Twitter Engineering Blog. We bring it to you here for your convenience.
ASF, the Apache Software Foundation, recently announced the graduation of Apache Parquet, a columnar storage format for the Apache Hadoop ecosystem. At Twitter, we’re excited to be a founding member of the project.
Apache Parquet is built to work across programming languages,
Thanks to Chen Song, Data Team Lead at AppNexus, for allowing us to republish the following post about his company’s use case for Apache Parquet (incubating at this writing), the open standard for columnar storage across the Apache Hadoop ecosystem.
At AppNexus, over 2MM log events are ingested into our data pipeline every second. Log records are sent from upstream systems in the form of Protobuf messages. Raw logs are compressed in Snappy when stored on HDFS.
Thanks to Big Data Solutions Architect Matthieu Lieber for allowing us to republish the post below.
A customer of mine wants to take advantage of both worlds: work with his existing Apache Avro data, with all of the advantages that it confers, but take advantage of the predicate push-down features that Parquet provides. How to reconcile the two?
For more information about combining these formats,