Apache Flume Development Status Update

Categories: Avro Data Ingestion Flume General Hadoop HBase

Apache Flume is a scalable, reliable, fault-tolerant, distributed system designed to collect, transfer, and store massive amounts of event data into HDFS. Apache Flume recently graduated from the Apache Incubator as a Top Level Project at Apache. Flume is designed to send data over multiple hops from the initial source(s) to the final destination(s). Click here for details of the basic architecture of Flume. In this article, we will discuss in detail some new components in Flume 1.x (also known as Flume NG), which is currently on the trunk branch, techniques and components that can be be used to route the data, configuration validation, and finally support for serializing events.

In the past several months, contributors have been busy adding several new sources, sinks and channels to Flume. Flume now supports Syslog as a source, where sources have been added to support Syslog over TCP and UDP.

Flume now has a high performance persistent channel – the File Channel. This means if the agent fails for any reason before events committed by the source are not removed and the transaction committed by the sink, the events will reloaded from disk and can be taken when the agent starts up again. The events will only be removed from the channel when the transaction is committed by the sink. The File channel uses a Write Ahead Log to save events.

Something that was high on our list of priorities was to give applications the ability to easily write data to flume agents. So the RPC code was refactored into a separate Flume SDK, which can be used to push events into flume. The SDK has two clients, one which simply is able to append data to a single agent and the second client supports failover (marked in the figure as FC) . It takes a list of agents through configuration, and will connect to them, one by one, when a failure happen.

We have also added the capability to plug in an EventSerializer which can serialize an event and write to an output stream. This event serializer interface can be used to convert the event data into a format required by the user, such as converting an event body into text and writing to an output stream.

Also recently added was a new Channel Selector called the MultiplexingChannelSelector. The multiplexing channel selector allows the user to divert the events from the source to one or more of the channels linked to it. The MultiplexingChannelSelector is configured to look at a specific event header to find pre-defined values. The configuration defines mappings of specific values of the header to a subset of the channels wired to that source. If one of the configured values for the header is found, the channel selector writes the event to the channel(s) which are mapped to that value. If the value of the header does not match any of the values defined in the configuration, then the event is written to a set of channels which are defined as “default channels.” An example configuration is below:

host2.sources.src1.channels = ch1 ch2
host2.sources.src1.selector.type = multiplexing
host2.sources.src1.selector.header = myheader
host2.sources.src1.selector.mapping.header1 = ch1
host2.sources.src1.selector.mapping.header2 = ch2
host2.sources.src1.selector.default = ch2

This configuration will cause events with header “myheader” whose value is header1 to go to ch1, and value, header2 to go to ch2. Any events with no header, or different value of the header will go to ch2.

We also are adding a new configuration system, which allows each component to specify the configuration it requires through configuration stubs. These stubs can be used to validate the configuration of individual components, and hence the entire agent. The main component which does the validation has been committed to the Flume code base, while individual configuration stubs and a standalone configuration tool are in the final stages of development/review.

Flume now has the capability to modify events in flight. Interceptors are Flume plugins (intended to allow for both “stock” plugins as well as custom Java code) that sit between a Source and a Channel. An installed Interceptor is given full access to each event that flows out of the Source it is attached to, and it may inspect, transform, or drop each event. Interceptors may be chained together to form a pipeline of plugin code that is executed for each event. Interceptors are the somewhat constrained answer in Flume 1.x to the concept of Decorators in Flume 0.9.x.

We are continuing to add more features to Flume like an HBase Sink, which was recently committed. The HBase Sink also provides a serializer interface which users can implement to serialize the events into HBase Put/Increments.

Flume has a very active and diverse developer community which has been continuously contributing more features and enhancing existing ones. As a result, Flume has proved to be extremely high performance, and showed that a single agent was able to write more than 70,000 events per second to HDFS. You can find more details of the performance analysis here thanks to Mike Percy and Will McQueen.

We want to make sure Flume becomes as useful as possible to anyone who would want to stream large amounts of data to HDFS and/or HBase. To get started with Flume, please take a look at the Flume Getting Started Guide.


4 responses on “Apache Flume Development Status Update