Apache Hadoop ecosystem, time to celebrate! The much-anticipated, significantly updated 4th edition of Tom White’s classic O’Reilly Media book, Hadoop: The Definitive Guide, is now available.
The Hadoop ecosystem has changed a lot since the 3rd edition. How are those changes reflected in the new edition?
The core of the book is about the core Apache Hadoop project, and since the 3rd edition, Hadoop 2 has stabilized and become the Hadoop runtime that most people are using. The 3rd edition actually covered both Hadoop 1 (based on the JobTracker) and Hadoop 2 (based on YARN), which made things a bit awkward at times since it flipped between the two and had to describe the differences. Only Hadoop 2 is covered in the 4th edition, which simplifies things considerably. The YARN material has been expanded and now has a whole chapter devoted to it.
This update is the biggest since the 1st edition, and in response to reader feedback, I reorganized the chapters to simplify the flow. The new edition is broken into parts (I. Hadoop Fundamentals, II. MapReduce, III. Hadoop Operations, IV. Related Projects, V. Case Studies), and includes a diagram to show possible pathways through the book (on p. 17).
The Hadoop ecosystem has been growing faster with each new edition, which makes it impossible to cover everything; even if I wanted to, there wouldn’t be enough space. The book is aimed primarily at users doing data processing, so in this edition I added two new chapters about processing frameworks (Apache Spark and Apache Crunch), one on data formats (Apache Parquet, incubating at this writing) and one on data ingestion (Apache Flume).
I’m also really pleased with the two new case studies in this edition: one about how Hadoop is used to manage records in a healthcare system (by Ryan Brush and Micah Whitacre), and one on building big data genomics pipelines (by Matt Massie).
Based on those changes, what do you want readers to learn?
I think the core Hadoop features are still important to understand—things like how HDFS stores files in blocks, how MapReduce input splits work, how YARN schedules work across nodes in the cluster. These ideas provide the foundation for learning how components covered in later chapters take advantage of these features. For example, Spark uses MapReduce input formats for reading and writing data efficiently, and it can run on YARN.
Beyond that, we’ve seen how the Hadoop platform as a whole has become even more powerful and flexible, and the new chapters reflect some of these new capabilities, such as iterative processing with Spark.
In a nutshell, what does your research process/methodology look like?
I think the two main things that readers want from a book like this are: 1) good examples for each component, and 2) an explanation of how the component in question works. Examples are important since they are concrete and allow readers to start using and exploring the system. In addition, a good mental model is important for understanding how the system works so users can reason about it, and extend the examples to cover their own use cases.
There’s a Martin Gardner quote that I cite in the book, and which sums up my approach to writing about technology: “Beyond calculus, I am lost. That was the secret of my column’s success. It took me so long to understand what I was writing about that I knew how to write in a way most readers would understand.”
I find that there’s really no substitute for reading the code to understand how a component works. I spend a lot of time writing small examples to test how different aspects of the component work. A few of these are turned into examples for the book. I also spend a lot of time reading JIRAs to understand the motivation for features, their design, and how they relate to other features. Finally, I’m very lucky to have access to a talented group of reviewers who work on Hadoop projects. Their feedback has undoubtedly improved the book.
Nothing can be completely “definitive.” What is good complementary material for this book?
The goal of my book is to explain how the component parts of Hadoop and its ecosystem work and how to use them—the nuts and bolts, as it were. What it doesn’t do is explain how to tie all the pieces together to build applications. For this I recommend Hadoop Application Architectures by Mark Grover, Ted Malaska, Jonathan Seidman, and Gwen Shapira, which explains how to select Hadoop components and use them to build a data application. For building machine-learning applications, I like Advanced Analytics with Spark by Sandy Ryza, Uri Laserson, Sean Owen, and Josh Wills.
My book has some material for Hadoop administrators, but Eric Sammer’s Hadoop Operations (2nd edition forthcoming) goes into a lot more depth. There are also books for most of the Hadoop components that go into more depth than mine.
It’s really gratifying to see the large number books coming out in the Hadoop and big data space.
Do you have a 5th edition in you?
I like to think so, but I’m not sure my family would agree (yet)!