Converting Apache Avro Data to Parquet Format in Apache Hadoop

Categories: Avro Guest Hadoop Parquet

Thanks to Big Data Solutions Architect Matthieu Lieber for allowing us to republish the post below.

A customer of mine wants to take advantage of both worlds: work with his existing Apache Avro data, with all of the advantages that it confers, but take advantage of the predicate push-down features that Parquet provides. How to reconcile the two?

For more information about combining these formats, see this.

For a quick recap on Avro, see my previous post. While you are at it, see why Apache Avro is currently the gold standard in the industry.

What we are going to demonstrate here: how to take advantage of existing tools to convert our existing Avro format into Apache Parquet (incubating at the time of this writing), and make sure we can query that transformed data.

Parquet Data

First let’s try to convert text data to Parquet, and read it back. Fortunately there is some code already from Cloudera to do this in MapReduce

The code from Cloudera: https://github.com/cloudera/parquet-examples, and doc here lets you read and write Parquet data. Let’s try this.

First, let’s create some Parquet data as input. We will use Hive for this, by directly converting Text data into Parquet.

Parquet Conversion

    1. Let’s create a csv data example, and create a text table (here, just 2 columns of integers) in HDFS pointing to it:

       

    2. Create a Parquet table in Hive, and convert the data to it:

    3. You will need to add Hadoop and Parquet libraries relevant to the project in say, Eclipse for the code needed to be built; therefore, all of the links to the proper libs needed to be added. We then export the code as a JAR (File->Export as Running Jar) and run it outside of Eclipse (otherwise, some Hadoop security issues ensue that prevent you to run the code).
    4. Run the program (you could also run java instead of Hadoop if you copy the data from hdfs to local disk). The arguments are: inputData as Parquet / outputData as csv. We just want to ensure that we can read the Parquet data and display it.

      See result: (csv file):

Avro Data Conversion

Avro Data Example

Let’s get some Avro data example working, from this post.

Avro Data Generation

Interestingly Hive doesn’t let you load/convert csv data into Avro like we did in the Parquet example.  

Let’s walk through an example of creating an Avro schema with its IDL, and generating some data. Let’s use this example, with this twitter.avsc schema:

and some data in twitter.json:

We will convert the data (in Json) into binary Avro format.

Transformation from Avro to Parquet Storage Format

So essentially use the best of both worlds: take advantage of the Avro object model and serialization format of Avro, and combine it with the columnar storage format of Parquet.

First we will reuse our Avro data that was created earlier.

  1. We will then take advantage of this code to convert the Avro data to Parquet data. This is a map-only job that simply sets up the right input and output format according to what we want.
  2. After compilation, let’s run the script on our existing Avro data:

     

    We get:

    Note that the Avro schema is converted directly to a Parquet-compatible format.

     

  3. Now let’s test our result in Hive. We first create a Parquet table (note the simple syntax in Hive 0.14+), then point to the data we just created via a LOAD command, and finally query our converted data directly.

    Parquet with Avro

    Let’s see verify our Parquet schema now that it is converted; note that the schema still refers to Avro:

That concludes our exercise! Let me know if additional questions.

 

facebooktwittergoogle_pluslinkedinmailfacebooktwittergoogle_pluslinkedinmail

2 responses on “Converting Apache Avro Data to Parquet Format in Apache Hadoop