The second edition of my book “Hadoop: The Definitive Guide”, published by O’Reilly, is now available. The first edition was launched at the Hadoop Summit in June 2009, and has gone on to sell well. Less than a year later I was asked to write the second edition. The Hadoop ecosystem has been growing fast (and continues to), and the bulk of the extra 100 pages in the second edition are devoted to three new projects: Hive, Avro, and Sqoop. The major changes are as follows:
- A chapter on Hive (Chapter 12). Hive is a data processing platform that provides a SQL interface to Hadoop. At the time I started out on the first edition Hive was a relatively new Hadoop contrib project from Facebook. Since then it has grown into an Apache Top-Level Project with a vibrant community and a wide user base spread across many organizations.
- A chapter on Sqoop (Chapter 15), written by Aaron Kimball, the project founder. Sqoop is a Cloudera-sponsored open-source tool for efficiently moving data between relational databases and HDFS.
- A section on Avro (in Chapter 4, “Hadoop I/O”). Avro was just starting out (at Yahoo!) at the time of the first edition, but is growing in importance, both for data serialization (which is what is covered in the book) and for RPC (which will likely be used for the foundations of Hadoop someday). Avro is now an Apache Top-Level Project. You can read the Avro section for free online.
- A section on security (in Chapter 9, “”Setting Up a Hadoop Cluster”). Adding Kerberos authentication to Hadoop has been a major undertaking by the Yahoo! engineering team, and this section gives an introduction to the topic and explains the changes that a user can expect to see.
- A new case study “Using Pig and Wukong to Explore Billion-edge Network Graphs” (in Chapter 16) by Philip (“flip”) Kromer of Infochimps.
The second edition continues to target the Hadoop 0.20 release family (which includes all the major distributions), although there have been many small updates and clarifications made throughout the text. The content for Pig, HBase, and ZooKeeper has been revved to reflect the latest versions, some of which involved significant updates (such as the new Load and Store UDF interfaces in Pig 0.7.0).
There is a companion website to the book where you can find example code and other information about the book.
Finally, I’d like to thank my editor Mike Loukides and the production team at O’Reilly for turning the second edition around so quickly, and John Kreisa at Cloudera who kept the process running smoothly.