Apache Avro: a New Format for Data Interchange

Apache Avro is a recent addition to Apache’s Hadoop family of projects.  Avro defines a data format designed to support data-intensive applications, and provides support for this format in a variety of programming languages.

Background

We’d like data-driven applications to be dynamic: folks should be able to rapidly combine datasets from different sources.  We want to facilitate novel, innovative exploration of data.  Someone should, for example, ideally be able to easily correlate point-of-sale transactions, web site visits, and externally provided demographic data, without a lot of preparatory work.  This should be possible on-the-fly, using scripting and interactive tools.

Current data formats often don’t work well for this.  XML and JSON are expressive, but they’re big, and slow to process.  When you’re processing petabytes of data, size and speed matter a lot.

Google uses a system called Protocol Buffers to address this.  (There are other systems, like Thrift, similar to Protocol Buffers, that I won’t explicitly discuss here, but to which my comments about Protocol Buffers also apply.)  Google has made Protocol Buffers freely available, but it’s not ideal for our purposes.

Generic Data

With Protocol Buffers, one defines data structures, then generates code that can efficiently read and write them.  However, if one wishes, from a scripting language, to quickly implement an experiment with Protocol Buffer data, one must first: locate the data structure definition; generate code for it; and, finally, load that code before one can touch the data.  That might not be that bad, but if one wanted to have, for example, a generic tool that could browse any dataset, it would have to first locate definitions, then generate and load code for each such dataset.  This complicates something that should be simple.

Avro’s format instead always stores data structure definitions with the data, in an easy-to-process form.  Avro implementations can then use these definitions at runtime to present data to applications in a generic way, rather than requiring code generation.

Code generation in Avro is optional: it’s nice in some programming languages to sometimes use specific data structures, that correspond to frequently serialized data types.  But, in scripting systems like Hive and Pig, code generation would be an imposition, so Avro does not require it.

An additional advantage of storing the full data structure definition with the data is that it permits the data to be written faster and more compactly.  Protocol Buffers add annotations to data, so data may still be processed even if the definition doesn’t exactly match the data.  However these annotations make the data slightly larger and slower to process.  Benchmarks have shown that Avro data, which does not need such annotations, is smaller and faster to process than that of other serialization systems.

Avro Schemas

Avro uses JSON to define a data structure’s schema.  For example, a two-dimensional point might be defined as an Avro record:


{"type": "record", "name": "Point",
 "fields": [
  {"name": "x", "type": "int"},
  {"name": "y", "type": "int"},
 ]
}

Each instance of this is serialized as simply two integers, with no additional per-record or per-field annotations.  Integers are written using a variable-lengthed zig-zag encoding.  So points with small positive and negative values can be written in as few as two bytes: 100 points might require just 200 bytes.

In addition to records and numeric types, Avro includes support for arrays, maps, enums, variable and fixed-length binary data and strings.  It also defines a container file format intended to provide good support for MapReduce and other analytical frameworks.  For details, see the Avro specification.

Compatibility

Applications evolve, and as they evolve their data structures can change.  We’d like new versions of an application to still be able to process data created by old versions, and vice versa.  Avro handles this in much the same way as Protocol Buffers.  When an application expects fields that are not present, Avro provides a default value, specified in the schema.  Avro ignores unexpected values that are present in data.  This doesn’t handle all back-compatibility issues, but it makes most common ones easy to handle.

RPC

Avro also lets one define Remote Procedure Call (RPC) protocols. While data types used in RPC are usually distinct from those in datasets, using a common serialization system is still useful.  Data-intensive applications require distributed RPC-based frameworks.  So, everywhere that we need to be able to process dataset files we also need to be able to use RPC.  Thus building these on a common base minimizes the chance that one would, e.g., be able to write code that will process the data, but unable to use a distributed framework to do so.

Integration with Hadoop

We’d like it to be easy to use Avro data in Hadoop’s MapReduce.  This is still a work in progress.  The issues MAPREDUCE-1126 and MAPREDUCE-815 track this.

Note that Avro data structures can specify their sort order, so complex data created in one programming language can be sorted by another.  Sorting is also possible without deserialization, and is thus quite fast.

We hope that Avro will replace Hadoop’s existing RPC.  Hadoop currently requires its clients and servers to run the exact same version of Hadoop.  We hope to use Avro to permit one to, e.g., have a single Hadoop application that can talk to multiple clusters running different versions of HDFS and/or MapReduce.

Finally, we hope that Avro will permit Hadoop applications to be more easily written in languages besides Java.  For example, once Hadoop’s built on Avro, we hope to support native MapReduce and HDFS clients in languages like Python, C and C++.

Talk from Hadoop World

Many of these topics were covered during my recent talk at Hadoop World, and we’re happy to release that video along with this blog post.

Filed under:

11 Responses
  • Patrick Hunt / November 02, 2009 / 10:11 AM

    Kick the tires and take Avro RPC for a spin: http://bit.ly/32T6Mk – Java and Python models available in the quickstart.

  • Christopher Smith / November 04, 2009 / 10:43 AM

    Protocol buffers and Thrift do encode a simple schema that can be used to parse their contents dynamically without any further data. Even if they didn’t, they both provide mechanisms for serializing the meta data and Hadoop could use this for cases where this would be of advantage. Even ignoring those two, there are tons of other serialization libraries that this as well (Hessian comes to mind immediately).

    Honestly, I have yet to see anything novel with Avro. The world doesn’t need another serialization mechanism. What it needs is for Hadoop to employ one or more of them.

  • duncan campbell / November 06, 2009 / 3:40 AM

    The ability to use something other than Java seems expedient:
    while it would be nice to have a universal language, there’s
    much that’s already been written in something else. I’d like
    to hear when you’ve a c interface ;)

    Thanks

  • Doug Cutting / November 06, 2009 / 11:08 AM

    Christopher, can you point me to documentation of how Protocol Buffers and Thrift encode the schema with the data in a way that every implementation can easily access it? They encode numeric field ids in the data, and Protocol Buffers encodes enough information to permit one to skip through data, but neither give you full type information, including field names, record names, etc.

    One could of course serialize the PB or Thrift schema with the data, but then one would have to also write a schema parser for each language implementation and also write readers and writers for each that use this schema at runtime to process data. That effort is roughly equivalent to implementing Avro, except that these other schema languages are harder to parse than a JSON-based schema language, and their binary encodings include annotations that are redundant when the schema is available.

    Put another way: if one implemented a C library that could parse a Thrift/PB schema and then process data written with it, that library would share almost no code with the existing Thrift/PB C library. It might share the readers & writers of primitive types, but that’s about it. And it would be harder to implement than Avro, since JSON is easier to parse and has existing parsers. Plus the encoding would be slightly larger and slower. So, if nearly no code is shared, implementation is more difficult and less efficient, what would be the substantial advantage of adding this to Thrift or PB rather than building it independently?

  • Doug Cutting / November 06, 2009 / 11:08 AM

    Duncan, The C version of Avro is making rapid progress.

  • James Norris / December 08, 2009 / 1:16 PM

    Hi Doug,
    “Christopher, can you point me to documentation of how Protocol Buffers and Thrift encode the schema with the data in a way that every implementation can easily access it? They encode numeric field ids in the data, and Protocol Buffers encodes enough information to permit one to skip through data, but neither give you full type information, including field names, record names, etc.”
    The Google protobuf library does have support for reflection and includes protobufs that describe protobuf schemas (type descriptors).
    http://code.google.com/p/protobuf/source/browse/trunk/src/google/protobuf/descriptor.proto
    This has the (potential) advantage of using the same format for both the metadata and the data, as opposed to the Avro approach of using JSON for the metadata and compact binary encodings for the data.
    In practice I think the metadata isn’t typically encoded alongside the data. Fields in encoded protobufs are identified by numeric tags rather than textual names, and the encoding format is designed to be self-describing (when it comes to structure) so that reading encoded data is possible even if the schema is unknown or if it has been altered since the data was written (i.e. it can adapt to version skew to some extent). That does incur some space overhead due to redundancy, as you mentioned; I don’t know how it affects the coding speed in practice.
    “Code generation in Avro is optional: it’s nice in some programming languages to sometimes use specific data structures, that correspond to frequently serialized data types. But, in scripting systems like Hive and Pig, code generation would be an imposition, so Avro does not require it.”
    I don’t believe it’s required for protobufs either, at least in the C implementation; there’s an “option optimize_for = SPEED;” directive that controls this.
    I’ll have to take a closer look at Avro when I have more time, but it looks like a promising approach. Thanks.

  • Doug Cutting / December 10, 2009 / 1:48 PM

    James, Thanks for providing the link to PB’s reflection support. I had not seen that before. I intend to similarly add the ability to write Avro schemas as Avro data, with a schema-for-schemas. This will be useful in Avro for when the schema varies frequently, since it will be smaller and faster to process than a textual schema.

Leave a comment


3 + six =