Introducing Morphlines: The Easy Way to Build and Integrate ETL Apps for Hadoop

by Wolfgang Hoschek

Posted in Technical | July 11, 2013 9 min read

This post is the first in a series of blog posts about Cloudera Morphlines, a new command-based framework that simplifies data preparation for Apache Hadoop workloads. To check it out or help contribute, you can find the code here.

Cloudera Morphlines is a new open source framework that reduces the time and effort necessary to integrate, build, and change Hadoop processing applications that extract, transform, and load data into Apache Solr, Apache HBase, HDFS, enterprise data warehouses, or analytic online dashboards. If you want to integrate, build, or facilitate transformation pipelines without programming and without substantial MapReduce skills, and get the job done with a minimum amount of fuss and support costs, this post gets you started.

A “morphline” is a rich configuration file that makes it easy to define a transformation chain that consumes any kind of data from any kind of data source, processes the data, and loads the results into a Hadoop component. It replaces Java programming with simple configuration steps, and correspondingly reduces the cost and integration effort associated with developing, maintaining, or integrating custom ETL projects.

Morphlines is a library, embeddable in any Java codebase. A morphline is an in-memory container of transformation commands. Commands are plugins to a morphline that perform tasks such as loading, parsing, transforming, or otherwise processing a single record. A record is an in-memory data structure of name-value pairs with optional blob attachments or POJO attachments. The framework is extensible and integrates existing functionality and third-party systems in a simple and straightforward manner.

Morphlines replaces Java programming with simple configuration steps, reducing the cost and effort of doing custom ETL.

The Morphlines library was developed as part of Cloudera Search. It powers a variety of ETL data flows from Apache Flume and MapReduce into Solr. Flume covers the real time case, whereas MapReduce covers the batch processing case.

Since the launch of Cloudera Search, Morphlines development has graduated into the Kite SDK (formerly Cloudera Development Kit) in order to make the technology accessible to a wider range of users, contributors, integrators, and products beyond Search. The Kite SDK is a set of libraries, tools, examples, and documentation focused on making it easier to build systems on top of the Hadoop ecosystem (and hence a perfect home for Morphlines). The SDK is hosted on GitHub and encourages involvement by the community.

Currently, Morphlines works with multiple indexing workloads, but could easily be embedded into Crunch, HBase, Impala, Pig, Hive, or Sqoop. Your feedback is not only welcome but crucial, so please let us know where you see more opportunities for integration going forward!

Later in this post, we’ll examine a simple example use case in which we collect syslog output and make it available for searching. Before we go into the details of the problem and solution, though, let’s take a look at the Morphlines data model and processing model.

Processing Model

Morphlines can be seen as an evolution of Unix pipelines where the data model is generalized to work with streams of generic records, including arbitrary binary payloads. A morphline is an efficient way to consume records (e.g. Flume events, HDFS files, RDBMS tables, or Apache Avro objects), turn them into a stream of records, and pipe the stream of records through a set of easily configurable transformations on the way to a target application such as Solr, for example as outlined in the following figure:

In this figure, a Flume Source receives syslog events and sends them to a Flume Morphline Sink, which converts each Flume event to a record and pipes it into a readLine command. The readLine command extracts the log line and pipes it into a grok command. The grok command uses regular expression pattern matching to extract some substrings of the line. It pipes the resulting structured record into the loadSolr command. Finally, the loadSolr command loads the record into Solr, typically a SolrCloud. In the process, raw data or semi-structured data is transformed into structured data according to application modelling requirements.

The Morphlines framework ships with a set of frequently used high-level transformation and I/O commands that can be combined in application specific ways. The plugin system allows the adding of new transformations and I/O commands and integrates existing functionality and third-party systems in a straightforward manner.

This integration enables rapid Hadoop ETL application prototyping, complex stream and event processing in real time, flexible log file analysis, integration of multiple heterogeneous input schemas and file formats, as well as reuse of ETL logic building blocks across Hadoop ETL applications.

The Kite SDK ships an efficient runtime that compiles a morphline on the fly. The runtime executes all commands of a given morphline in the same thread. Piping a record from one command to another implies just a cheap Java method call. In particular, there are no queues, no handoffs among threads, no context switches, and no serialization between commands, which minimizes performance overhead.

Data Model

Morphlines manipulate continuous or arbitrarily large streams of records. A command transforms a record into zero or more records. The data model can be described as follows: A record is a set of named fields where each field has an ordered list of one or more values. A value can be any Java Object. That is, a record is essentially a hash table where each hash table entry contains a String key and a list of Java Objects as values. Note that a field can have multiple values and any two records need not use common field names. This flexible data model corresponds exactly to the characteristics of the Solr/Lucene data model.

Not only structured data, but also binary data, can be passed into and processed by a morphline. By convention, a record can contain an optional field named _attachment_body, which can be a Java java.io.InputStream or Java byte[]. Optionally, such binary input data can be characterized in more detail by setting the fields named _attachment_mimetype (such as “application/pdf”) and _attachment_charset (such as “UTF-8”) and _attachment_name (such as “cars.pdf”), which assists in detecting and parsing the data type. (This is similar to the way email works.)

This generic data model is useful to support a wide range of applications. For example, the Apache Flume Morphline Solr Sink embeds the morphline library and executes a morphline to convert flume events into morphline records and load them into Solr. This sink fills the body of the Flume event into the _attachment_body field of the morphline record, as well as copies the headers of the Flume event into record fields of the same name. As another example, the Mappers of the MapReduceIndexerTool fill the Java java.io.InputStream referring to the currently processed HDFS file into the _attachment_body field of the morphline record. The Mappers of the MapReduceIndexerTool also fill metadata about the HDFS file into record fields, such as the file’s name, path, size, last modified time, etc. This way a morphline can act on all data received from Flume and HDFS.

Use Cases

Commands can access all record fields. For example, commands can parse fields, add fields, remove fields, rename fields, find and replace values of existing fields, split a field into multiple fields, split a field into multiple values, or drop records. Often, regular expression-based pattern matching is used as part of the process of acting on fields. The output records of a command are passed to the next command in the chain. A command has a Boolean return code, indicating success or failure.

For example, consider the case of a multi-line input record: A command could take this multi-line input record and divide the single record into multiple output records, one for each line. This output could then later be further divided using regular expression commands, splitting each single line record out into multiple fields in application specific ways.

A command can extract, clean, transform, join, integrate, enrich and decorate records in many other ways. For example, a command might join records with external data sources such as relational databases, key-value stores, local files or IP Geo lookup tables. It might also perform tasks such as DNS resolution, expand shortened URLs, fetch linked metadata from social networks, perform sentiment analysis and annotate the record accordingly, continuously maintain statistics for analytics over sliding windows, or compute exact or approximate distinct values and quantiles.

A command can also consume records and pass them to external systems. For example, a command might load records into Apache Solr or write them to a MapReduce Reducer, or load them into an Enterprise Data Warehouse or a Key Value store such as HBase, or pass them into an online dashboard, or write them to HDFS.

Embedding into a Host System

A morphline has no notion of persistence, durability, distributed computing, or node failover — it’s basically just a chain of in-memory transformations in the current thread. There is no need for a morphline to manage multiple processes, nodes, or threads because this is already addressed by host systems such as MapReduce, Flume, or Storm. However, a morphline does support passing notifications on the control plane to command subtrees. Such notifications include BEGIN_TRANSACTION, COMMIT_TRANSACTION, ROLLBACK_TRANSACTION, and SHUTDOWN.

Syntax

The morphline configuration file is implemented using the HOCON format (Human Optimized Config Object Notation) developed by typesafe.com. HOCON is basically JSON slightly adjusted for the configuration file use case. HOCON syntax is defined at the HOCON github page.

Currently Available Commands

The Kite SDK includes several maven modules that contain morphline commands for flexible log file analysis, single-line records, multi-line records, CSV files, JSON, commonly used HDFS file formats Avro and Hadoop Sequence Files, regular expression based pattern matching and extraction, operations on record fields for assignment and comparison, operations on record fields with list and set semantics, if-then-else conditionals, string and timestamp conversions, scripting support for dynamic java code, a small rules engine, logging, metrics and counters, integration with Solr including SolrCloud, integration and access to the large set of file formats supported by the Apache Tika parser library, auto-detection of MIME types from binary data using Tika, and decompression and unpacking of arbitrarily nested container file formats, among others. These are described in detail in the Cloudera Morphlines Reference Guide.

Syslog Use Case

Now we’ll review a concrete example use case: We want to extract information from a syslog file and index it into Solr in order to make it available for Search queries. The corresponding program should be able to run standalone, or embedded inside a Flume Sink or embedded in a MapReduce job.

A syslog file contains semi-structured lines of the following form:

Feb  4 10:46:14 syslog sshd[607]: listening on 0.0.0.0 port 22.

The program should extract the following record from the log line, convert the timestamp, and load the record into Solr:

priority : 164
timestamp : Feb  4 10:46:14
hostname : syslog
program : sshd
pid : 607
msg : listening on 0.0.0.0 port 22.
message : Feb  4 10:46:14 syslog sshd[607]: listening on 0.0.0.0 port 22.

These rules can be expressed with the Morphline commands readLine, grok, convertTimestamp, sanitizeUnknownSolrFields, logInfo and loadSolr, by editing a morphline.conf file to read as follows:

morphlines : [
  {
    # Name used to identify a morphline. E.g. used if there are multiple
    # morphlines in a morphline config file
    id : morphline1

    # Import all morphline commands in these java packages and their
    # subpackages. Other commands that may be present on the classpath are
    # not visible to this morphline.
    importCommands : ["com.cloudera.**", "org.apache.solr.**"]

    commands : [
      {
        # Parse input attachment and emit a record for each input line               
        readLine {
          charset : UTF-8
        }
      }

      {
        grok {
          # Consume the output record of the previous command and pipe another
          # record downstream.
          #
          # A grok-dictionary is a config file that contains prefabricated
          # regular expressions that can be referred to by name. grok patterns
          # specify such a regex name, plus an optional output field name.
          # The syntax is %{REGEX_NAME:OUTPUT_FIELD_NAME}
          # The input line is expected in the "message" input field.
          dictionaryFiles : [src/test/resources/grok-dictionaries]
          expressions : {
            message : """%{SYSLOGTIMESTAMP:timestamp} %{SYSLOGHOST:hostname} %{DATA:program}(?:\[%{POSINT:pid}\])?: %{GREEDYDATA:msg}"""
          }
        }
      }

      # Consume the output record of the previous command, convert
      # the timestamp, and pipe another record downstream.
      #
      # convert timestamp field to native Solr timestamp format
      # e.g. 2012-09-06T07:14:34Z to 2012-09-06T07:14:34.000Z
      {
        convertTimestamp {
          field : timestamp
          inputFormats : ["yyyy-MM-dd'T'HH:mm:ss.SSS'Z'", ""MMM d HH:mm:ss"]
          inputTimezone : America/Los_Angeles
          outputFormat : "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"
          outputTimezone : UTC
        }
      }

      # Consume the output record of the previous command, transform it
      # and pipe the record downstream.
      #
      # This command deletes record fields that are unknown to Solr
      # schema.xml. Recall that Solr throws an exception on any attempt to
      # load a document that contains a field that isn't specified in
      # schema.xml.
      {
        sanitizeUnknownSolrFields {
          # Location from which to fetch Solr schema
          solrLocator : {           
            collection : collection1       # Name of solr collection
            zkHost : "127.0.0.1:2181/solr" # ZooKeeper ensemble
          }
        }
      }

      # log the record at INFO level to SLF4J
      { logInfo { format : "output record: {}", args : ["@{}"] } }

      # load the record into a Solr server or MapReduce Reducer
      {
        loadSolr {
          solrLocator : {           
            collection : collection1       # Name of solr collection
            zkHost : "127.0.0.1:2181/solr" # ZooKeeper ensemble
          }
        }
      }
    ]
  }
]

Example Driver Program

This section provides a simple sample Java program that illustrates how to use the API to embed and execute a morphline in a host system (full source code).

/** Usage: java ...   ...  */
public static void main(String[] args) {
  // compile morphline.conf file on the fly
  File configFile = new File(args[0]);
  MorphlineContext context = new MorphlineContext.Builder().build();
  Command morphline = new Compiler().compile(configFile, null, context, null);

  // process each input data file
  Notifications.notifyBeginTransaction(morphline);
  for (int i = 1; i < args.length; i++) {
    InputStream in = new FileInputStream(new File(args[i]));
    Record record = new Record();
    record.put(Fields.ATTACHMENT_BODY, in);
    morphline.process(record);
    in.close();
  }
  Notifications.notifyCommitTransaction(morphline);
}

You can use this program to rapidly prototype ETL logic and try it with sample files.

This code is, in essence, what production tools like MapReduceIndexerTool, Apache Flume Morphline Solr Sink, and Apache Flume MorphlineInterceptor are running as part of their operation.

To print diagnostic information such as the content of records as they pass through the morphline commands, consider enabling TRACE log level — for example, by adding the following line to your log4j.properties file:

log4j.logger.com.cloudera.cdk.morphline=TRACE

Any Questions?

If you’ve got any questions, please do ask us. The best place is over on the mailing list or our new community forum. Alternatively, if you are trying out Cloudera Search, post your questions here.

A detailed description of all morphline commands can be found in the Cloudera Morphlines Reference Guide.

The Kite SDK lives on GitHub where users can freely browse, download, fork, and contribute back to the source. Community contributions are not only welcome but strongly encouraged. Since most Java developers use tools such as Maven (or tools that are compatible with Maven repositories), artifacts are also available from the Cloudera Maven Repository for easy project integration.

Wolfgang Hoschek is a Software Engineer on the Platform team and the lead developer on Morphlines.

Wolfgang Hoschek

More by this author

Editor's Choice

Business

Generative AI for the Enterprise

Technical

Building Trust in Public Sector AI Starts with Trusting Your Data