Flume community update: September 2010

The past month has been exciting and productive for the community using and developing Cloudera’s Flume!  This young system is a core part of Cloudera’s Distribution for Hadoop (CDH) that is responsible for streaming data ingest.  There has been a great influx of interest and many contributions, and in this post we will provide a quick summary of this month’s new developments. First, we’re happy to announce the availability of Flume v0.9.1 and we will describe some of its updates. Second, we’ll talk about some of the exciting new integration features coming down the pipeline. Finally we will briefly mention some community growth statistics, as well as some recent and upcoming talks about Flume.

Flume v0.9.1

Flume v0.9.1 is now available both in tarball and packaged forms. This version resolves 63 issues and contains several key improvements and bugs fixes. Much of this release is focused on improving the stability of Flume’s internals to help users quickly get Flume up and running and to help developers build extensions to Flume.

You can download the new release as an update to your Redhat RPM or Debian DEB based package managers. Or, you can download it in tarball form from Cloudera’s archive, or as always from the Cloudera’s github repository .

The key functional highlights include:

To improve the documentation and enhance debugging support, we have added:

For more details, read the full release notes.

Up and coming Flume features

One of Flume’s key design principles is extensibility. We are happy people are taking advantage of this to integrate Flume with other systems. Some new features currently being developed will enable the next release of Flume to have greater integration with CDH’s core components as well as other systems in the Hadoop ecosystem.

Here are some of the new major contributions near completion or actively in the works:

  • Flume + Hive integration plugin. Mozilla’s Anurag Phadke has been working with Cloudera’s Carl Steinbach to automatically import data ingested by Flume into Hive warehouses.
  • Flume + HBase integration plugin. Several guests at the recent Cloudera Hackathon improved upon our initial Flume/HBase connector and posted it so the community could continue improving it. Since then, a more generic design was proposed and Cloudera’s new intern, Dani Rayan, has volunteered to implement it.
  • Flume + Cassandra integration plugin. Tyler Hobbs contributed a first version of this plugin.  It is blocked by some Thrift compatibility and dependency issues.
  • Secured data transport via TLS. Kim Vogt and Ben Standefer from SimpleGeo, with some feedback from David Zuelke of Bitextender have been working on adding TLS-based wire encryption to the RPC sources and sinks to provide secure data center communications.
  • Flume + Kerberized HDFS integration. Flume takes its first steps to support the newer versions of HDFS that require Kerberos authentication in order to read from and write to HDFS.
  • Generic compression codec support for output files. This enables users to choose from all of the codecs Hadoop supports: gzip, bzip2, and deflate.  It should also enable the LZO codec with a little extra work.
  • Documentation improvements galore. Currently in the works are a semantics specification for sources and sinks, and step-by-step instructions for connecting Flume to common sources such as  Apache web servers, syslog, and existing scribe loggers.

Community

We are really grateful to the folks who have been exploring and talking about the project!  The guests (Dustin Sallings of NorthScale and Ron Bodkin among others…) who tried out Flume at the Cloudera’s Hackathon day gave us valuable feedback.  In the past month, Cloudera’s Henry Robinson presented “Inside Flume” at Hadoop Day in Seattle.   It is also great to see that some folks are slated to present at Hadoop World 2010 about integrating and using Flume.  Otis Gospodnetic of Sematext will be talking about analytics with Flume and HBase. Also, Anurag Phadke from Mozilla will be presenting a talk about of his experiences integrating Flume-collected data automatically into Hive. He recently posted some details in his blog.

It is great to see the community growing and we love hearing from all of you as well! It has been two months since Flume was open sourced, and our main github repository now has 136 watchers and 24 forks.  Our user mailing list has 102 members and our developers mailing list has 41 members. Please join us! If you are using Flume and want to keep up with where it is going, join the mailing lists and follow us on Twitter at @cloudera and #flume.  If you need help, just send questions to the mailing lists or chat with us directly in IRC on channel #flume at irc.freenode.net.  To meet the Flume Team and contributors in person, you should join us in New York City at Hadoop World on October 12th!

It has been a lot of fun so far, and we’re really looking forward to the following months!

Thanks from everyone on the Cloudera Team.
3 Responses
  • Otis Gospodnetic / September 08, 2010 / 11:40 PM

    Jonathan – regarding HBase plugin – there is nothing very generic and reusable yet, right? That FLUME-126 seems to be a specific example of hooking up Flume and HBase, if I’m not mistaken. Thanks.

  • Jonathan Hsieh / September 09, 2010 / 10:29 AM

    Otis,

    You are correct.

    However, we have a design for a flexible HBase sink that was discussed in the mailing list. Dani Rayan has volunteered to take the original sink HBase sink and extend the implementation with the updated design.

    I’ve just updated https://issues.cloudera.org/browse/FLUME-6 to contain this design.

  • Otis Gospodnetic / September 09, 2010 / 1:33 PM

    Yeah, I remember the discussion – it was one of the Sematext guys who initiated it (but haven’t had time to actually execute…. and I think he may actually be waiting for some feedback…). OK, so FLUME-6 is still where it’s at – thanks.

Leave a comment


six × = 12