The ecosystem around Apache Hadoop has grown at a tremendous rate. Folks now can use many different pieces of software to process their large data sets, and most choose to use several of these components. Data collected by Flume might be analyzed by Pig and Hive scripts. Data imported with Sqoop might be processed by a MapReduce program. To facilitate these and other scenarios, data produced by each component must be readily consumed by other components.
One might address this data interoperability in a variety of manners, including the following:
- Each system might be extended to read all the formats generated by the other systems. In the limit, this approach is not practical, since one cannot easily anticipate all of the formats new systems might generate.
- A library of data conversion programs could be assembled. This would unfortunately add a processing step, to convert the data between formats, slowing processing pipelines. Note however that many data conversion libraries operate by converting data into and out of a lingua franca format, using a single format as a pivot point. This hints at a third possibility.
- Enable each system to read and write a common format. Some systems might use other formats internally for performance, but whenever data is meant to be accessible to other systems a common format is used.
In practice all of these strategies will used to some extent. However the last strategy, a common format, seems to offer the most efficient path both in terms of engineering effort and processing time. This article will focus on the use of Avro’s data file format as such a common format.
Apache Avro is a data serialization format. Avro shares many features with Google’s Protocol Buffers and Apache Thrift, including:
- Rich data types.
- Fast, compact serialization.
- Support for many programming languages.
- Datatype evolution, also known as versioning.
Avro additionally provides some other features that are especially useful when storing data, namely:
- Avro defines a standard file format. Avro data files are self-describing, containing the full schema for the data in the file. Thus users can exchange Avro data files without also having to separately communicate metadata. Once an Avro data file is written, one will always be able to read it, with full datatype information, without relying on any external software or metadata repository. Avro data files also support compression, using Gzip or Snappy codecs.
- Avro’s serialization is more compact. Avro avoids storing a field identifier with each field value. For some datasets this savings can be significant.
- Avro implementations permit one to dynamically define new datatypes and to easily process previously unseen datatypes, without generation and loading of code. This provides natural support for script and query languages.
- Avro datatypes can define their sort-order, facillitating use of Avro data in MapReduce or ordered key/value stores.
Avro as a Common Format
Most of the major ecosystem components already or will soon support reading and writing Avro data files:
- MapReduce: I added support for Java MapReduce programs, included in Avro 1.4 and greater.
- Streaming: Tom White from Cloudera has added support for Hadoop Streaming programs to Avro (AVRO-808 & AVRO-830).
- Flume 0.9.2 and above support collecting data in Avro’s format (FLUME-133), contributed by Jon Hsieh of Cloudera. Note also that Flume has recently been accepted into the Apache Incubator and will soon be known as Apache Flume.
- Sqoop 1.3 can import data as Avro data files in HDFS from a relational database (SQOOP-207), contributed by Tom White of Cloudera. Sqoop has also recently been accepted into the Apache Incubator.
- Pig release 0.9 will be able read and write Avro data files (PIG-1748), thanks to Lin Guo and Jakob Homan at LinkedIn.
- Hive support for reading and writing Avro data files has been posted by Jakob Homan of LinkedIn, and should hopefully be included in Hive 0.9 (HIVE-895).
- HCatalog input and output drivers have been contributed by Tom White of Cloudera (HCATALOG-49).
- Thiruvalluvan M. G. from Yahoo! is working on a column-major format for Avro, which would accelerate Hive and Pig queries (AVRO-806).
For folks who are currently using Protocol Buffers or Thrift to store data, some tools for conversion are planned:
- Raghu Angadi from Twitter is working on tools that will let folks read and write their Thrift-defined data structures as Avro format data (AVRO-804).
- We also hope to soon add tools to convert between Protocol Buffers and Avro (AVRO-805).
At Cloudera we’re committed to helping Avro become a common format for the Hadoop ecosystem. It’s great to see so many other companies and individuals also investing in Avro.