Cloudera Blog · Avro Posts
RecordBreaker: Automatic structure for your text-formatted data
- by Michael Cafarella
- July 13, 2011
- 3 comments
This post was contributed by Michael Cafarella, an assistant professor of computer science at the University of Michigan. Mike’s research interests focus on databases, in particular managing Web data. Before becoming a professor, he was one of the founders of the Nutch and Hadoop projects with Doug Cutting. This first version of RecordBreaker was developed by Mike in conjunction with Cloudera.
RecordBreaker is a project that automatically turns your text-formatted data (logs, sensor readings, etc) into structured data, without any need to write parsers or extractors. In particular, RecordBreaker targets Avro as its output format. The project’s goal is to dramatically reduce the time spent preparing data for analysis, enabling more time for the analysis itself.
Hadoop’s HDFS is often used to store large amounts of text-formatted data: log files, sensor readings, transaction histories, etc. Much of this data is “near-structured”: the data has a format that’s obvious to a human observer, but is not made explicit in the file itself.
Data Interoperability with Apache Avro
The ecosystem around Apache Hadoop has grown at a tremendous rate. Folks now can use many different pieces of software to process their large data sets, and most choose to use several of these components. Data collected by Flume might be analyzed by Pig and Hive scripts. Data imported with Sqoop might be processed by a MapReduce program. To facilitate these and other scenarios, data produced by each component must be readily consumed by other components.
Data Interoperability
One might address this data interoperability in a variety of manners, including the following:
Three Reasons Why Apache Avro Data Serialization is a Good Choice for OpenRTB
This is a guest repost from the DataXu blog. Click here to view the original post.
I recently evaluated several serialization frameworks including Thrift, Protocol Buffersand Avro for a solution to address our needs as a demand side platform, but also for a protocol framework to use for the OpenRTB marketplace as well. The working draft of OpenRTB 2.0 uses simple JSON encoding, which has many advantages including simplicity and ubiquity of support. Many OpenRTB contributors requested we support at least one binary standard as well, to improve bandwidth usage and CPU processing time for real-time bidding at scale.
After reviewing many candidates, Apache Avro proved to be the best solution.
Tracing with Apache Avro
Written by Patrick Wendell, an amazing summer intern with Cloudera and an Avro Committer.
In my summer internship project at Cloudera, I added RPC tracing as a first-order feature of Apache Avro. Avro is a platform for data storage and exchange that caters to data-intensive, dynamic applications. My project focused on Avro’s RPC functionality.
Apache Avro 1.3.0
Apache Avro was added the to Hadoop family last April and last year there were three Avro releases: 1.0.0 in July, 1.1.0 in September and 1.2.0 in October. After the 1.2.0 release, Doug Cutting introduced Avro: a New Format for Data Interchange on this blog and the Avro team went right to work building the next release of Avro.
It’s a new year and there’s a new Avro: 1.3.0.
Starting with Avro 1.3.0, the Avro team is releasing packages specially tailored to consumers of each language. For example, Python users can download an egg, Java users can manage jars using Maven and C/C++ users can grab an autotools package ready to `./configure; make`. Speaking of languages, we’re thrilled to announce that there’s a Ruby implementation for Avro now!
Apache Avro: a New Format for Data Interchange
Apache Avro is a recent addition to Apache’s Hadoop family of projects. Avro defines a data format designed to support data-intensive applications, and provides support for this format in a variety of programming languages.
Background
We’d like data-driven applications to be dynamic: folks should be able to rapidly combine datasets from different sources. We want to facilitate novel, innovative exploration of data. Someone should, for example, ideally be able to easily correlate point-of-sale transactions, web site visits, and externally provided demographic data, without a lot of preparatory work. This should be possible on-the-fly, using scripting and interactive tools.
Current data formats often don’t work well for this. XML and JSON are expressive, but they’re big, and slow to process. When you’re processing petabytes of data, size and speed matter a lot.