Cloudera Engineering Blog · Avro Posts
This blog was originally posted on the Apache Blog: https://blogs.apache.org/sqoop/entry/apache_sqoop_overview
Using Hadoop for analytics and data processing requires loading data into clusters and processing it in conjunction with other data that often resides in production databases across the enterprise. Loading bulk data into Hadoop from production systems or accessing it from map reduce applications running on large clusters can be a challenging task. Users must consider details like ensuring consistency of data, the consumption of production system resources, data preparation for provisioning downstream pipeline. Transferring data using scripts is inefficient and time consuming. Directly accessing data residing on external systems from within the map reduce applications complicates applications and exposes the production system to the risk of excessive load originating from cluster nodes.
This post was contributed by Michael Cafarella, an assistant professor of computer science at the University of Michigan. Mike’s research interests focus on databases, in particular managing Web data. Before becoming a professor, he was one of the founders of the Nutch and Hadoop projects with Doug Cutting. This first version of RecordBreaker was developed by Mike in conjunction with Cloudera.
RecordBreaker is a project that automatically turns your text-formatted data (logs, sensor readings, etc) into structured data, without any need to write parsers or extractors. In particular, RecordBreaker targets Avro as its output format. The project’s goal is to dramatically reduce the time spent preparing data for analysis, enabling more time for the analysis itself.
The ecosystem around Apache Hadoop has grown at a tremendous rate. Folks now can use many different pieces of software to process their large data sets, and most choose to use several of these components. Data collected by Flume might be analyzed by Pig and Hive scripts. Data imported with Sqoop might be processed by a MapReduce program. To facilitate these and other scenarios, data produced by each component must be readily consumed by other components.
One might address this data interoperability in a variety of manners, including the following:
I recently evaluated several serialization frameworks including Thrift, Protocol Buffersand Avro for a solution to address our needs as a demand side platform, but also for a protocol framework to use for the OpenRTB marketplace as well. The working draft of OpenRTB 2.0 uses simple JSON encoding, which has many advantages including simplicity and ubiquity of support. Many OpenRTB contributors requested we support at least one binary standard as well, to improve bandwidth usage and CPU processing time for real-time bidding at scale.
Written by Patrick Wendell, an amazing summer intern with Cloudera and an Avro Committer.
Apache Avro was added the to Hadoop family last April and last year there were three Avro releases: 1.0.0 in July, 1.1.0 in September and 1.2.0 in October. After the 1.2.0 release, Doug Cutting introduced Avro: a New Format for Data Interchange on this blog and the Avro team went right to work building the next release of Avro.
It’s a new year and there’s a new Avro: 1.3.0.
Apache Avro is a recent addition to Apache’s Hadoop family of projects. Avro defines a data format designed to support data-intensive applications, and provides support for this format in a variety of programming languages.
We’d like data-driven applications to be dynamic: folks should be able to rapidly combine datasets from different sources. We want to facilitate novel, innovative exploration of data. Someone should, for example, ideally be able to easily correlate point-of-sale transactions, web site visits, and externally provided demographic data, without a lot of preparatory work. This should be possible on-the-fly, using scripting and interactive tools.