Cloudera Engineering Blog · Avro Posts

RecordBreaker: Automatic structure for your text-formatted data

This post was contributed by Michael Cafarella, an assistant professor of computer science at the University of Michigan. Mike’s research interests focus on databases, in particular managing Web data. Before becoming a professor, he was one of the founders of the Nutch and Hadoop projects with Doug Cutting. This first version of RecordBreaker was developed by Mike in conjunction with Cloudera.

RecordBreaker is a project that automatically turns your text-formatted data (logs, sensor readings, etc) into structured data, without any need to write parsers or extractors. In particular, RecordBreaker targets Avro as its output format. The project’s goal is to dramatically reduce the time spent preparing data for analysis, enabling more time for the analysis itself.

Data Interoperability with Apache Avro

The ecosystem around Apache Hadoop has grown at a tremendous rate. Folks now can use many different pieces of software to process their large data sets, and most choose to use several of these components. Data collected by Flume might be analyzed by Pig and Hive scripts. Data imported with Sqoop might be processed by a MapReduce program. To facilitate these and other scenarios, data produced by each component must be readily consumed by other components.

Data Interoperability

One might address this data interoperability in a variety of manners, including the following:

Three Reasons Why Apache Avro Data Serialization is a Good Choice for OpenRTB

This is a guest repost from the DataXu blog. Click here to view the original post.

I recently evaluated several serialization frameworks including Thrift, Protocol Buffersand Avro for a solution to address our needs as a demand side platform, but also for a protocol framework to use for the OpenRTB marketplace as well. The working draft of OpenRTB 2.0 uses simple JSON encoding, which has many advantages including simplicity and ubiquity of support. Many OpenRTB contributors requested we support at least one binary standard as well, to improve bandwidth usage and CPU processing time for real-time bidding at scale.

Tracing with Apache Avro

Written by Patrick Wendell, an amazing summer intern with Cloudera and an Avro Committer.


Apache Avro 1.3.0

Apache Avro was added the to Hadoop family last April and last year there were three Avro releases: 1.0.0 in July, 1.1.0 in September and 1.2.0 in October.  After the 1.2.0 release, Doug Cutting introduced Avro: a New Format for Data Interchange on this blog and the Avro team went right to work building the next release of Avro.

It’s a new year and there’s a new Avro: 1.3.0.

Apache Avro: a New Format for Data Interchange

Apache Avro is a recent addition to Apache’s Hadoop family of projects.  Avro defines a data format designed to support data-intensive applications, and provides support for this format in a variety of programming languages.


We’d like data-driven applications to be dynamic: folks should be able to rapidly combine datasets from different sources.  We want to facilitate novel, innovative exploration of data.  Someone should, for example, ideally be able to easily correlate point-of-sale transactions, web site visits, and externally provided demographic data, without a lot of preparatory work.  This should be possible on-the-fly, using scripting and interactive tools.

Newer Posts