The New "Hadoop in Practice" Book: A Chat with The Author
Today we bring you a brief interview with Alex Holmes, author of the new book, Hadoop in Practice (Manning). You can learn more about the book and download a free sample chapter here.
There are a few good Hadoop books on the market right now. Why did you decide to write this book, and how is it complementary to them?
When I started working with Hadoop I leaned heavily on Tom White’s excellent book, Hadoop: The Definitive Guide (O’Reilly Media), to learn about MapReduce and how the internals of Hadoop worked. As my experience grew and I started working with Hadoop in production environments I had to figure out how to solve problems such as moving data in and out of Hadoop, using compression without destroying data locality, performing advanced joining techniques and so on. These items didn’t have a lot of coverage in existing Hadoop books, and that’s really the idea behind Hadoop in Practice – it’s a collection of real-world recipes that I learned the hard way over the years.
Hadoop in Practice covers more advanced aspects of working with Hadoop such as MapReduce and HDFS patterns, performance tuning and debugging. The book also looks at how Hadoop can be used as a platform for data science and for data warehousing by studying R integration techniques, and intermediary Pig and Hive recipes. Data mining is another important topic today, and a book on Hadoop isn’t complete without a look at how Mahout lets you run your favorite algorithms at scale.
I believe this is the first Hadoop book which presents its contents in a problem/solution format. Accompanying each solution is the background behind it, as well as alternatives if the recommended solution doesn’t work in the reader’s particular situation. Another unique trait of my book is its heavy use of visual aids to help explain complex concepts, and the large number of working code examples which can be immediately leveraged by the reader.
Who is your intended reader?
I view my book as being useful to developers that have committed to using Hadoop, are familiar with Hadoop fundamentals, and are starting to ask questions such as “What data format should I use to store my data?”, “How do I run algorithms such as PageRank in MapReduce?” and “How do I use Bloom filters to optimize my joins?” The applications of traditional software engineering practices such as unit testing, debugging and performance tuning to Hadoop are also covered to help ease the adoption of Hadoop in engineering teams.
In your research, what did you learn that you did not already know?
Writing about these subjects in my book required me to pile through reams of Hadoop source code as well as Hadoop-related open-source projects so that I could better explain concepts to readers. This was often a humbling experience – I would start out thinking I understood a topic inside-out, only to discover that in reality my working knowledge wasn’t all it was cracked up to be!
Using Bloom filters in MapReduce to optimize joins was one area which I hadn’t had any exposure to prior to writing the book, but it has ended up being one of the recipes that I use the most.
In your view, what is the #1 most important thing needed for wide adoption of Hadoop?
I believe making Hadoop easier to deploy, administer and interact with are key to its continued adoption. Organizations such as Cloudera are helping address the administrative challenges, and I believe that we will see technologies such as YARN and Cloudera Impala open Hadoop up to a wider audience.
What recent additions to the Hadoop stack have you most excited? YARN, Impala, HA?
These technologies are really exciting, and a testament to how Hadoop is evolving and maturing. CTO’s and folks in Operations will love Hadoop HA and how Hadoop is being transformed into a de facto enterprise technology. Impala is a great boon to data scientists and data analysts everywhere that have been crying out for a real-time analytics layer on top of their data in Hadoop. I’m a developer and I’m very interested in the opportunities that YARN opens up for me in terms of alternative computing models to MapReduce. It’s a great time to be working with Hadoop!