As part of our series of announcements at the recent Hadoop Summit, Cloudera released two of its previously internal projects into open source. One of those was the HUE user interface environment, which we’ll be saying a bit more about later this week. The other was our data movement platform Flume. We’ve been working on Flume for many months, and it’s really exciting to be able to share the details of what we’ve been doing. In this blog post I’d like to introduce Flume to the world, and say a little about how it might help problems with data collection that you might be facing right now.
What is Flume?
We’ve seen our customers have great success using Hadoop for processing their data, but the question of how to get the data there to process in the first place was often significantly more challenging. Many customers had produced ad-hoc solutions with complicated shell scripts and periodically running batch copies. Such solutions, while minimally effective, don’t allow the user any insight into how they were running, whether or not they were succeeding and whether or not any data were being lost. Changing or reconfiguring the scripts to collect more or different data was hard. They were, at best, ‘hit and hope’ solutions.
At the same time, we observed that much more data are being produced than most organisations have the software infrastructure to collect. We are very keen to allow our users to take advantage of all the data that their cluster is generating. Looking around, we saw no solutions that supported all the features that we wanted to provide to our customers, incuding reliable delivery of data and an easy configuration system that didn’t involve logging in to a hundred machines to restart a process, as well a powerful extensibility solution for easy integration with a wide variety of data sources.
As a result, we designed and built Flume. Flume is a distributed service that makes it very easy to collect and aggregate your data into a persistent store such as HDFS. Flume can read data from almost any source – log files, Syslog packets, the standard output of any Unix process – and can deliver it to a batch processing system like Hadoop or a real-time data store like HBase. All this can be configured dynamically from a single, central location – no more tedious configuration file editing and process restarting. Flume will collect the data from wherever existing applications are storing it, and whisk it away for further analysis and processing.
Four design goals
We designed Flume with four main goals for data collection in mind.
- Reliability – We recognise that failures regularly occur in clusters – machines crash, networks fail, cables get unplugged – and we designed Flume to be robust to that possibility. We also recognise that different kinds of data have different levels of importance, and therefore designed Flume to have fine-grained tunable reliability guarantees that dictate how much effort Flume goes to to ensure that your data are delivered when those failures happen.
- Scalability – Flume is designed to capture a lot of data from a wide variety of sources. As such we want to be able to run Flume on hundreds of machines with no scalability bottleneck. Flume is horizontally scalable, which broadly means that performance is roughly proportional to the number of machines on which it is deployed.
- Manageability – Flume is designed to be managed centrally. A medium sized Flume deployment is too large to configure machine-by-machine. We want Flume to enable flexible data collection, and that means making it easy to make changes.
- Extensibility – We want Flume to be able to collect all your data. We’ve written a number of different ‘sources’ and ‘sinks’ for Flume ourselves, but in the spirit of teaching a man to fish we’ve made sure that the API for doing so is extremely simple. That way, anyone can prototype and build new connectors for Flume to almost any data source very quickly indeed.
Flume’s architecture is very simple. Data passes through a simple network of logical nodes, which are lightweight entities that know how to do exactly one thing: read data from some source and send it on to a sink. A source might be a file, process output or any other of the many sources that Flume supports – or it might be another Flume logical node. Similarly, a sink might be another Flume logical node, or it might be HDFS, S3 or even IRC or an e-mail.
We can link logical nodes together into a chain, or flow. At the beginning of a flow is the original source of the data, and the sink of the final logical node in the chain defines where the data will eventually be delivered. We call the first logical node in a flow the agent, and the last logical node the collector. The intermediate nodes are called processors, and their job is to do some light transformation and filtering on the events as they come through in order to get them ready for storage.
A single Flume process can house many logical nodes. This makes it very easy to configure, create, delete and restart individual logical nodes without having to restart the Flume process.
All of this configuration is controlled by the Flume master, which is a separate distributed service which takes care of monitoring and updating all the logical nodes in a Flume deployment. Users can make configuration changes, such as changing a logical node’s source or creating a brand new logical node, through the Flume master, either by using the shell that we’ve provided or by using the simple web-based administration interface.
An example flow
The diagram above shows the layout of a simple flow for capturing web server logs. Three logical nodes are involved, each with different source and sink configurations.
- The agent is continually watching the tail of an Apache HTTP server log file. Flume’s tail source correctly deals with log rotation, so you don’t have to change your logging policy to integrate with Flume. Each line of the log is captured as one event, and then sent downstream to…
- The processor receives individual events from the agent, and uses regular expressions to identify and extract the name of the user’s browser (such as ‘Firefox’ or ‘Internet Explorer’) from the log event. This metadata is then attached to the event so that it’s available for later processing by…
- The collector receives the annotated events from the processor, and writes them to a path in HDFS that is determined by the browser string that was extracted earlier. Flume’s HDFS sink uses a simple template language to figure out where to write events. In this case, all logs from users using Firefox are saved to /weblogs/Firefox/ in HDFS.
This simple example shows how Flume is easily able to capture your data, and to perform some non-trivial processing on it. This entire flow can be constructed and configured from the Flume Master, without any need to log on to the individual machines involved.
For more information
Download and get started with Flume here.
Flume is open source! Grab the source code from Cloudera’s Github page here.
The Flume User Guide is a one-stop resource for a lot more detail on Flume’s architecture, usage and internals.
We have a series of technical blog posts on using Flume planned for the near future. Stay tuned for much more Flume!