Category Archives: Parquet

Using Impala at Scale at Allstate

Categories: Guest Hive Impala Parquet Use Case

Our thanks to Don Drake (@dondrake), an independent technology consultant who is currently working as a Principal Big Data Consultant at Allstate Insurance, for the guest post below about his experiences with Impala.

It started with a simple request from one of the managers in my group at Allstate to put together a demo of Tableau connecting to Cloudera Impala. I had previously worked on Impala with a large dataset about a year ago while it was still in beta,

Read more

Using Apache Hadoop and Impala with MySQL for Data Analysis

Categories: Guest Hardware Impala Parquet

Thanks to Alexander Rubin of Percona for allowing us to re-publish the post below!

Apache Hadoop is commonly used for data analysis. It is fast for data loads and scalable. In a previous post I showed how to integrate MySQL with Hadoop. In this post I will show how to export a table from  MySQL to Hadoop, load the data to Cloudera Impala (columnar format),

Read more

How-to: Use Parquet with Impala, Hive, Pig, and MapReduce

Categories: Hive How-to Impala MapReduce Parquet Pig

The CDH software stack lets you use your tool of choice with the Parquet file format – – offering the benefits of columnar storage at each phase of data processing. 

An open source project co-founded by Twitter and Cloudera, Parquet was designed from the ground up as a state-of-the-art, general-purpose, columnar file format for the Apache Hadoop ecosystem. In particular, Parquet has several features that make it highly suited to use with Cloudera Impala for data warehouse-style operations:

  • Columnar storage layout: A query can examine and perform calculations on all values for a column while reading only a small fraction of the data from a data file or table.

Read more

Native Parquet Support Comes to Apache Hive

Categories: Hive Impala Parquet

Bringing Parquet support to Hive was a community effort that deserves congratulations!

Previously, this blog introduced Parquet, an efficient ecosystem-wide columnar storage format for Apache Hadoop. As discussed in that blog post, Parquet encodes data extremely efficiently and as described in Google’s original Dremel paper. (For more technical details on the Parquet format read Dremel made simple with Parquet, or go directly to the open and community-driven Parquet Format specification.)

Before discussing the Parquet Hive integration,

Read more

Impala Performance Update: Now Reaching DBMS-Class Speed

Categories: General Hive Impala Parquet

Impala’s speed now beats the fastest SQL-on-Hadoop alternatives. Test for yourself!

Since the initial beta release of Cloudera Impala more than one year ago (October 2012), we’ve been committed to regularly updating you about its evolution into the standard for running interactive SQL queries across data in Apache Hadoop and Hadoop-based enterprise data hubs. To briefly recap where we are today:

  • Impala is being widely adopted.

Read more