Congratulations to Hari Shreedharan, Cloudera software engineer and Apache Flume committer/PMC member, for the early release of his new O’Reilly Media book, Using Flume: Stream Data into HDFS and HBase. It’s the seventh Hadoop ecosystem book so far that was authored by a current or former Cloudera employee (but who’s counting?).
Why did you decide to write this book?
I have been working on Apache Flume for the past two years, and have been actively responding to user and developer queries on the developer and user lists on Apache and Cloudera. Even though Flume and its components are pretty well documented, I realized that having a book that documented each component in detail, explained end-to-end deployment etc would really help users. There were a lot of lessons that I learned over the years building Flume and working with customers who have deployed Flume on thousands of servers. I felt that a book on Flume would be a good place to share these lessons.
Who should read this book?
The book essentially is meant for operations engineers who are planning to deploy or have already deployed Flume and developers who want to build custom Flume components for their specific use-cases.
Most sections of the book cover the configuration and operational aspects of Flume that can help operations engineers deploy and configure Flume. I have tried to share most of the lessons I learnt helping customers and users deploy and configure Flume in production.
Flume is highly customizable. This allows developers who want to customize Flume write their own plugins. In this book, I describe how to implement plugins for various Flume components with examples.
What are your favorite things about Flume that you want people to know?
Flume is extremely flexible by design. Literally, every major component in a Flume agent is pluggable and users can deploy their own implementations. This leads to a wide variety of usecases that as developers even we did not expect to see. Custom formats, modifying events specific to use-case, lightweight processing, and so on can be easily done in Flume by simply dropping in plugins.
What are some other things that the Flume community can do to make Hadoop data ingestion easier?
One of the things I hope would be added to Flume is a centralized configuration mechanism that allows the user to deploy the configuration in one place than on every single machine. Cloudera Manager added this functionality some time back, but still it would be nice to see this happening within Flume. There is work going on in the Apache community to integrate this feature into Flume (FLUME-1491). Once this gets committed, Flume configuration would become much easier.