Apache Avro at RichRelevance

Categories: Avro Community Guest

This is a guest post from RichRelevance Principal Architect and Apache Avro PMC Chair Scott Carey.

In Early 2010 at RichRelevance, we were searching for a new way to store our long lived data that was compact, efficient, and maintainable over time. We had been using Hadoop for about a year, and started with the basics – text formats and SequenceFiles. Neither of these were sufficient. Text formats are not compact enough, and can be painful to maintain over time. A basic binary format may be more compact, but it has the same maintenance issues as text. Furthermore, we needed rich data types including lists and nested records.

After analysis similar to Doug Cutting’s blog post, we chose Apache Avro. As a result we were able to eliminate manual version management, reduce joins during data processing, and adopt a new vision for what data belongs in our event logs. On Cyber Monday 2011, we logged 343 million page view events, and nearly 100 million other events into Avro data files.

Avoiding Version Management Baggage

Have you ever seen code for manual serialization version management like the below?

Manual version management is painful. If you evolve what data you store continuously, it does not take long to end up with dozens of versions. In order to read every version that has been written and stored, your code has to carry a lot of baggage.

With Avro, you can avoid writing code like the above. The concept is simple. Store the schema used to write your data along with your data, and use it to make the written data conform to the schema that the reader expects. If a field is missing, use the default. If has been removed or moved, handle it.

Over the last two years, we have doubled the complexity of our our page view schema across about 15 schema versions. There is not one line of code that deals with version management, and the current code can read any of the data written over that time. What may be a surprise to some is that our old code can read newly written data as well. The data is both forward and backward compatible, within the rules in the Avro Specification.

Leveraging Complex Data Types

Avro supports complex data types such as arrays, maps, enumerations, and nested records. The Avro data model makes it possible to serialize any non-recursive data structure, including trees and heterogeneous lists. We use this property to describe our events using Avro Schemas that map to natural object representations on our front end servers. For example, one of the elements in a page view is an array of product recommendation sets, each set containing a list of products displayed. Another element in a page view is what we call a page context – each type of page on a merchant’s site has a unique context that differs from other page types. A product page context is the product being displayed. A search page context is the search terms in the search query. There are about 30 page context types, and we represent the range of page context possibilities using an Avro union, so that all of these different event variations can be written in the same format and to the same log.

With a simpler data model, one might have had to log each context type separately, making it harder to get a full picture of what happened in a single request during analysis.

A New Vision for an Event Log

With the above properties of Avro, we were able to formulate a new vision for what an event log should be. The new model has the following properties:

  • A singe HTTP request creates a single, atomic log event defined by an Avro schema
  • The event contains all of the resolved request inputs
  • The event contains the result of any decisions made during the request

Together, these imply that it is never necessary to join different sets of data together to reconstruct what happened in an individual request during analysis. This also significantly reduces the value of data contained in raw HTTP logs, since the Avro based logs become the origin for all major processing. Since raw HTTP logs are significantly larger than compressed binary structured data, this significantly reduces the size of data we must keep for long periods of time.

More Avro at RichRelevance

We have built Hive and Pig adapters to map our Avro data into these tools for ad-hoc queries and automated tasks. Additionally, we leverage the same Avro schemas from our log files to store click streams in HBase. We also use Avro to store data compactly in key-value stores.

The log file example is what I call a schema first use case of Avro, where we define a schema for log events that can be used across different systems over a long period of time. An alternative usage style is what I call code first, where you start with code and bind that to a serialization with a less schema-centric view. I feel that the code first usage style is more applicable for data that lives for short or medium time scales, such as with RPC or MapReduce intermediates. We will be deepening our investment in Avro and using it with code first use cases in the future, in the process working with the community to improve the developer experience for those use cases.

Avro is a growing, evolving project that I see as more broad than a serialization framework. At heart Avro is about applying a schema to data, in order to manipulate that data in well defined ways. Serialization, validation, and transformation are only some of the operations you can apply to data that conforms to a known schema. Over time the project will grow to have more and more functionality centered around operations you can apply to data that conforms to an Avro Schema. I look forward to working with the Avro community as the project continues to evolve!