Three Reasons Why Apache Avro Data Serialization is a Good Choice for OpenRTB

This is a guest repost from the DataXu blog. Click here to view the original post.

I recently evaluated several serialization frameworks including Thrift, Protocol Buffersand Avro for a solution to address our needs as a demand side platform, but also for a protocol framework to use for the OpenRTB marketplace as well. The working draft of OpenRTB 2.0 uses simple JSON encoding, which has many advantages including simplicity and ubiquity of support. Many OpenRTB contributors requested we support at least one binary standard as well, to improve bandwidth usage and CPU processing time for real-time bidding at scale.

After reviewing many candidates, Apache Avro proved to be the best solution.

Apache Avro

To demonstrate what differentiates Avro from the other frameworks (the link to my source code is at the end of this post), I put together a quick test of key features. The following are the key advantages of Avro 1.5:

* Schema evolution – Avro requires schemas when data is written or read. Most interesting is that you can use different schemas for serialization and deserialization, and Avro will handle the missing/extra/modified fields.

* Untagged data – Providing a schema with binary data allows each datum be written without overhead. The result is more compact data encoding, and faster data processing.

* Dynamic typing – This refers to serialization and deserialization without code generation. It complements the code generation, which is available in Avro for statically typed languages as an optional optimization.

Schema Evolution

This is the most exciting feature! It allows for building less decoupled and more robust systems. Below, I made significant changes to the schema, and things still work fine. This flexibility is a very interesting feature for rapidly evolving protocols like OpenRTB.

The following example demonstrates how this works. First, I created a new (example) schema. (Avro schemas are defined in JSON):

{
    "type": "record",
    "name": "Employee",
    "fields": [
        {"name": "name", "type": "string"},
        {"name": "age", "type": "int"},
        {"name": "emails", "type": {"type": "array", "items": "string"}},
        {"name": "boss", "type": ["Employee","null"]}
    ]
}

Next, I serialized a few records into a binary file using that schema. After that, I evolved my schema to the following:

{
    "type": "record",
    "name": "Employee",
    "fields": [
        {"name": "name", "type": "string"},
        {"name": "yrs", "type": "int", "aliases": ["age"]},
        {"name": "gender", "type": "string", "default":"unknown"},
        {"name": "emails", "type": {"type": "array", "items": "string"}}
    ]
}

This is a snapshot of the changes I made to the schema:

1) Renamed the field ‘age’ to ‘yrs’. Thanks to the alias feature, I can retrieve the value of ‘age’ by using the field name ‘yrs’.

2) Added a new ‘gender’ field, and defined a default value for it. This can be used to set values during deserialization as this field isn’t present in the original schema records.

3) Removed the ‘boss’ field.

Finally, I deserialized the binary data file with this new schema, and print it out. Success!

Untagged Data

There are two ways to encode data when serializing with Avro: binary or JSON. In the binary file, the schema is included at the beginning of file. I verified that the binary data was serialized untagged, which resulted in a smaller footprint. Another interesting point is that the schema can be defined, and then the data can be encoded/decoded in JSON; allowing you to define a schema for JSON rich data structures. Anyone needing to implement validation for a JSON protocol (like we did for OpenRTB) will appreciate this feature. And switching between binary and JSON encoding is simply a one-line code change. Switching JSON protocol to a binary format in order to achieve better performance is pretty straightforward with Avro.

Dynamic Typing

The key abstraction is GenericData.Record. This is essentially a set of name-value pairs where name is the field name, and value is one of the Avro supported value types. I found the dynamic typing to be very easy to use. When a generic record is instantiated, you have to provide a JSON-encoded schema definition. To access the fields, just use put/get methods like you would with any map. This approach is referred to as “generic” in Avro, in contrast to the “static” code generation approach also supported by Avro. The extra flexibility of the generic data handling has performance implications. But, this excellent benchmark – https://github.com/eishay/jvm-serializers/wiki/ – shows the penalty is minor, and the benefit is a simplified code base.

In conclusion, Avro is a unique serialization framework that works, although it took a bit of experimentation to get the code working. If you are interested in my Java code for an example of how Avro can be used, you can find it here: https://github.com/rfoldes/Avro-Test.

Robert Foldes

Senior Architect, DataXu

Filed under:

1 Response
  • Cowtowncoder / April 19, 2012 / 9:16 PM

    Have you actually tested performance difference? Modern JSON parsers (Jackson) are very efficient, and it it not necessarily given that Avro is any faster. And unlike Avro, JSON does not require schema (although JSON Schema exists if that is something desireable).
    Even better, one can use data binding to POJOs; and data evolution can be deal with by usual rules of adding properties, removing (or, with explicit object level handling via getter/setter methods).
    Tree-based handling is of course one alternative (similar to GenericData).

    So I guess my suggestion would be first figuring out limits (if any) of current approach — there is lots of hype around binary formats, and added cost of much more rigid changes, schemas is not always offset by actual realized performance improvements.

Leave a comment


+ three = 9