RecordBreaker: Automatic structure for your text-formatted data

This post was contributed by Michael Cafarella, an assistant professor of computer science at the University of Michigan. Mike’s research interests focus on databases, in particular managing Web data. Before becoming a professor, he was one of the founders of the Nutch and Hadoop projects with Doug Cutting. This first version of RecordBreaker was developed by Mike in conjunction with Cloudera.

RecordBreaker is a project that automatically turns your text-formatted data (logs, sensor readings, etc) into structured data, without any need to write parsers or extractors. In particular, RecordBreaker targets Avro as its output format. The project’s goal is to dramatically reduce the time spent preparing data for analysis, enabling more time for the analysis itself.

Hadoop’s HDFS is often used to store large amounts of text-formatted data: log files, sensor readings, transaction histories, etc. Much of this data is “near-structured”: the data has a format that’s obvious to a human observer, but is not made explicit in the file itself.

Imagine you have a simple file listing, stored in listing.txt:

5 mjc staff 170 Mar 14 2011 14:14 bin
5 mjc staff 170 Mar 12 2011 05:13 build
1 mjc staff 11080 Mar 14 2011 14:14 build.xml

This “near-structured” data has metadata that is obvious to people who are familiar with file listings: a file owner, a file size, a last-modified date, etc. It’s easy for people despite the fact that certain strings, such as the date and time, cannot be parsed with simple whitespace breaks. In order for a user to process such data with MapReduce, Pig, or some similar tool, she must explicitly and laboriously reconstruct the metadata that is simple for anyone who just eyeballs the data.

Performing this reconstruction usually entails writing a parser or extractor, often one based on relatively brittle regular expressions. For some very common data, writing a good parser for it is probably worthwhile. However, there are also album track listings, temperature readings, flight schedules, and many other kinds of data; the number of good parsers we need to write gets large, quickly. Writing all of these straightforward extractors, again and again, is a time-consuming and error-prone pain for everyone. We believe it is a major obstacle to faster and easier data analytics.

The RecordBreaker project aims to automatically generate structure for text-embedded data. It consists of two main components.


LearnStructure

LearnStructure takes a text file as input and derives a parser that breaks lines of the file into typed fields. For example, the above file listing is broken into fields that include the file owner mjc, the group owner staff, etc. It emits all the schemas and code necessary to turn the raw text file into a file full of structured data. For example, we discover a JSON schema for the above file listing that looks like this:

{
   "type" : "record",
   "name" : "record_1",
   "namespace" : "",
   "doc" : "RECORD",
   "fields" : [ {
      "name" : "base_0",
      "type" : "int",
      "doc" : "Example data: '5', '5', '1'"
   }, {
      "name" : "base_2",
      "type" : "string",
      "doc" : "Example data: 'mjc', 'mjc', 'mjc'"
   }, {
      "name" : "base_4",
      "type" : "string",
      "doc" : "Example data: 'staff', 'staff', 'staff'"
   }, {
      "name" : "base_6",
      "type" : "int",
      "doc" : "Example data: '170', '170', '11080'"
   }, {
      "name" : "base_8",
      "type" : {
         "type" : "record",
         "name" : "base_8",
         "doc" : "",
         "fields" : [ {
            "name" : "month",
            "type" : "int",
            "doc" : ""
         }, {
            "name" : "day",
            "type" : "int",
            "doc" : ""
         }, {
            "name" : "year",
            "type" : "int",
            "doc" : ""
         } ]
      },
      "doc" : "Example data: '(14, 3, 2011)', '(12, 3, 2011)', '(14, 3, 2011)'"
   }, {
      "name" : "base_10",
      "type" : {
         "type" : "record",
         "name" : "base_10",
         "doc" : "",
         "fields" : [ {
            "name" : "hrs",
            "type" : "int",
            "doc" : ""
         }, {
            "name" : "mins",
            "type" : "int",
            "doc" : ""
         }, {
            "name" : "secs",
            "type" : "int",
            "doc" : ""
         } ]
      },
      "doc" : "Example data: '(14, 14, 0)', '(5, 13, 0)', '(14, 14, 0)'"
   }, {
      "name" : "base_12",
      "type" : "string",
      "doc" : "Example data: 'bin', 'build', 'build.xml'"
   } ]
}

Of course, the field names here are nonsense. All of the values, except for subfields of the date and timestamp records, have nondescriptive synthetically-generated names. The LearnStructure step attempts to recover the type of each field, but has no way to know its name or role. Obtaining names for these fields is the job of the SchemaDictionary step. For now, we just live with these bad synthetic names.


SchemaDictionary

SchemaDictionary takes data that’s been parsed by LearnStructure and applies topic-specific labels. For example, mjc should ideally be labelled as owner or perhaps user. The parsed staff data should be labelled as group.

The SchemaDictionary tool matches the newly-parsed data against a known database of structures. It finds the closest match, then assigns human-understandable names based on the best-matching previously-observed dataset. For example, with the above data and a small set of known datasets, SchemaDictionary can find that base_10 should actually be timemodified, that base_8 should be datemodified, and so on. Depending on the input data and the known database of structures, this labelling may be more or less accurate.

As mentioned, the target structured data format is Avro. Avro allows efficient cross-platform data serialization, similar to Thrift or Protocol Buffers. Data stored in Avro has many advantages (read Doug Cutting’s recent overview of Avro for more) and many tools either support Avro or will soon: Hadoop MapReduce, Apache Pig, and others.


Related Work

Our work on the LearnStructure component draws inspiration from the PADS research project (http://www.padsproj.org/index.html), in particular the paper From Dirt to Shovels: Fully Automatic Tool Generation from Ad Hoc Data, by Fisher, Walker, Zhu, and White. Published in POPL, 2008.. That paper itself draws on many papers in the area of information extraction and related fields. The authors have released code for their system, written in ML. ML is a great language, but is not well-suited to our needs: it is not supported by Avro, and is unlikely to appeal to many of the developers currently involved with the Hadoop ecosystem.

SchemaDictionary is more generally inspired by database schema mapping systems. (A famous example is described in The Clio Project: Managing Heterogeneity, by Miller, Hernandez, Haas, Yan, Ho, Fagin, and Popa, published in SIGMOD Record 30(1), March 2001, pp.78-83.) Schema mapping systems are usually designed to help database administrators merge existing databases; for example, when company A purchases company B and must then merge the employee lists. These tools are often expensive and expect a lot of administrator attention. In contrast, our SchemaDictionary is for busy data analysts who simply want to check out a novel dataset as quickly as possible. It is fast and simple, but can only handle relatively simple structures (rendering it inappropriate for databases, but on target for the kind of data that is popular in text-based formats).

Project

RecordBreaker works, but is not complete. It is just the start of what we hope will be many interesting applications and research projects. Please take a look at the code and documentation (the repo is at https://github.com/cloudera/RecordBreaker, and the tutorial is at http://cloudera.github.com/RecordBreaker/). Maybe you can pitch in and help.

Filed under:

3 Responses

Leave a comment


eight − 5 =