Fixes in CDH 5.5 make writing Parquet data for Apache Impala (incubating) much easier.
Over the last few months, several Cloudera customers have provided the feedback that Parquet is too hard to configure, with the main problem being finding the right layout for great performance in Impala. For that reasons, CDH 5.5 contains new features that make those configuration problems go away.
Auto-Detection of HDFS Block Size
For example, you may have seen this warning: Read <some-big-number> MB of data across network that was expected to be local. This problem usually bit you when trying to set the Parquet row-group size higher than the default 128MB in Apache Hive. It typically means that the row-group size was larger than the underlying HDFS blocks. Because a row group needs to be handled by a single worker, one that spans HDFS blocks can’t be read locally and will slow down a query by reading lots of data over the network.
The fix for this problem is easy: use a HDFS block size that is at least as large as your row-group size. But because those are two different settings it was easy to forget to set the HDFS block size, and thus you’d find out later that you have to rewrite the data. In CDH 5.5, we’ve added a fix to automatically detect this case and use the right block size in Hive and any other Java-based Parquet writer.
Improved Reading of Multi-Block Files
You might also recognize this warning message: Parquet files should not be split into multiple hdfs-blocks. That was an annoying indicator that although you had written data correctly, it wasn’t in an optimal layout for Impala performance. The inevitable next question is: How do you fix the data you’ve already written and how do you prevent this in the future? Unfortunately, until CDH 5.5, there wasn’t a good answer.
Apache Hadoop file formats, including Parquet, are intended to be split across multiple HDFS blocks by design. Starting a new file after a certain amount of data has been written isn’t a feature in Hive and other writers.
To fix this problem, we decided to improve how Impala reads multi-block files. Starting in CDH 5.5, Impala reads each block where it is stored, not just the first block in a file. When you’re creating Parquet tables in Hive, you no longer have to worry about writing files that are limited to just one HDFS block. Write large files all you want, and Impala will do the right thing.
One of the reasons why Impala preferred single-block files is to avoid reading data remotely. In addition to improving how Impala handles multi-block files, we also needed to update how data is written in Parquet to eliminate unnecessary remote reads. In 5.5, row groups are automatically aligned with the underlying HDFS blocks by padding the space at the end of a row group before the next HDFS block starts. This increases the file size a little bit, but keeps each row group contained in a single HDFS block.
These changes make writing data optimized for Impala much easier by removing the headache of structuring your Parquet for Impala. Instead, Impala works better with how Hive writes data.
More to Come
Stand by; we will continue to work on improving Impala and Parquet in future CDH releases.
Ryan Blue is a Software Engineer at Cloudera, a Parquet committer/member of the Parquet PMC, and an Apache Avro committer.