Flume community update: September 2010
The past month has been exciting and productive for the community using and developing Cloudera’s Flume! This young system is a core part of Cloudera’s Distribution for Hadoop (CDH) that is responsible for streaming data ingest. There has been a great influx of interest and many contributions, and in this post we will provide a quick summary of this month’s new developments. First, we’re happy to announce the availability of Flume v0.9.1 and we will describe some of its updates. Second, we’ll talk about some of the exciting new integration features coming down the pipeline. Finally we will briefly mention some community growth statistics, as well as some recent and upcoming talks about Flume.
Flume v0.9.1 is now available both in tarball and packaged forms. This version resolves 63 issues and contains several key improvements and bugs fixes. Much of this release is focused on improving the stability of Flume’s internals to help users quickly get Flume up and running and to help developers build extensions to Flume.
You can download the new release as an update to your Redhat RPM or Debian DEB based package managers. Or, you can download it in tarball form from Cloudera’s archive, or as always from the Cloudera’s github repository .
The key functional highlights include:
- Support for gzip compressed output files.
- New and improved sources: scribe, syslog, tailDir (tail all files in a directory)
- Significant robustness improvements when using in the disk fail-over and end-to-end reliability modes.
- Significant robustness improvements when reconfiguring, commissioning, and decomissioning logical nodes.
To improve the documentation and enhance debugging support, we have added:
- A new section of the manual that explains how to build your own flume source, sink, and decorator plugins by example.
- An ‘ant eclipse’ option to automatically build project files for developing in the Eclipse IDE.
- Improved error messages in logs, exposed Flume internals such as current configuration properties, and source/sink catalogs to ease operator and developer debugging and verification.
For more details, read the full release notes.
Up and coming Flume features
One of Flume’s key design principles is extensibility. We are happy people are taking advantage of this to integrate Flume with other systems. Some new features currently being developed will enable the next release of Flume to have greater integration with CDH’s core components as well as other systems in the Hadoop ecosystem.
Here are some of the new major contributions near completion or actively in the works:
- Flume + Hive integration plugin. Mozilla’s Anurag Phadke has been working with Cloudera’s Carl Steinbach to automatically import data ingested by Flume into Hive warehouses.
- Flume + HBase integration plugin. Several guests at the recent Cloudera Hackathon improved upon our initial Flume/HBase connector and posted it so the community could continue improving it. Since then, a more generic design was proposed and Cloudera’s new intern, Dani Rayan, has volunteered to implement it.
- Flume + Cassandra integration plugin. Tyler Hobbs contributed a first version of this plugin. It is blocked by some Thrift compatibility and dependency issues.
- Secured data transport via TLS. Kim Vogt and Ben Standefer from SimpleGeo, with some feedback from David Zuelke of Bitextender have been working on adding TLS-based wire encryption to the RPC sources and sinks to provide secure data center communications.
- Flume + Kerberized HDFS integration. Flume takes its first steps to support the newer versions of HDFS that require Kerberos authentication in order to read from and write to HDFS.
- Generic compression codec support for output files. This enables users to choose from all of the codecs Hadoop supports: gzip, bzip2, and deflate. It should also enable the LZO codec with a little extra work.
- Documentation improvements galore. Currently in the works are a semantics specification for sources and sinks, and step-by-step instructions for connecting Flume to common sources such as Apache web servers, syslog, and existing scribe loggers.
We are really grateful to the folks who have been exploring and talking about the project! The guests (Dustin Sallings of NorthScale and Ron Bodkin among others…) who tried out Flume at the Cloudera’s Hackathon day gave us valuable feedback. In the past month, Cloudera’s Henry Robinson presented “Inside Flume” at Hadoop Day in Seattle. It is also great to see that some folks are slated to present at Hadoop World 2010 about integrating and using Flume. Otis Gospodnetic of Sematext will be talking about analytics with Flume and HBase. Also, Anurag Phadke from Mozilla will be presenting a talk about of his experiences integrating Flume-collected data automatically into Hive. He recently posted some details in his blog.
It is great to see the community growing and we love hearing from all of you as well! It has been two months since Flume was open sourced, and our main github repository now has 136 watchers and 24 forks. Our user mailing list has 102 members and our developers mailing list has 41 members. Please join us! If you are using Flume and want to keep up with where it is going, join the mailing lists and follow us on Twitter at @cloudera and #flume. If you need help, just send questions to the mailing lists or chat with us directly in IRC on channel #flume at irc.freenode.net. To meet the Flume Team and contributors in person, you should join us in New York City at Hadoop World on October 12th!