There’s an important new addition coming to the Apache Hadoop book ecosystem. It’s now in early release!
We are very happy to announce that the new Apache Hadoop book we have been writing for O’Reilly Media, Hadoop Application Architectures, is now available as an early release! It contains the first two chapters and can be found in O’Reilly’s Catalog and via Safari.
The goal of this book is to give developers and architects guidance on architecting end-to-end solutions using Hadoop and tools in the ecosystem. We have split the book into two broad sections: the first section discusses various considerations for designing applications, and the second section describes the architectures of some of the most common applications of Hadoop and their architecture, thereby applying the considerations learned in the previous section.
The two chapters that are now available concentrate on design considerations for data modeling and data movement in Hadoop. For example, have you ever wondered:
- Should your application store data in HDFS or Apache HBase?
- If HDFS, in what format should you store your data? What compression codec should you use? What should your HDFS directories be called, which users should own them? What should be your partitioning columns? In general, what are the best practices for designing your HDFS schema?
- If HBase, how can you best design your HBase schema?
- What’s the best way to store and access metadata in Hadoop? What types of metadata are involved?
- What are the considerations for designing schema for SQL-on-Hadoop (Apache Hive, Impala, HCatalog) tables?
In Chapter 1 – Data Modeling, we discuss considerations for above and many other questions to guide you with data modeling for your application.
And, if you have ever wondered:
- How much latency is OK for your end users – a few seconds, minutes, or hours? How does the latency change the complexity of your design?
- Which tools should you use for ingesting data into your cluster — file copy, Apache Flume, Apache Sqoop, Apache Kafka – and why?
- Which tools should you use for egress of data out of your cluster — file copy, Sqoop, and so on?
- Should you ingest or egress incrementally or overwrite it on every run? When using Flume, what kinds of sources, channels, sinks should you use?
- When using Sqoop, how do you choose a split-by column, and tune your Sqoop import?
- When using Kafka, how do you integrate Kafka with Hadoop and the rest of its ecosystem?
Then Chapter 2 – Data Movement, is for you.
As you may have noticed, the questions above are fairly broad, and the answers rely heavily on understanding your application and its use case. So, we provide a very holistic set of considerations, and offer recommendations based on those considerations when designing your application.
We encourage you to check us out, get involved early, and explore the answers to the above questions. And, of course, we always value your feedback – whether it’s about errata and improvements or topics that you’d like to learn more about.
The work we have done so far wouldn’t have been possible without the encouragement, support, and reviews of many people. Thanks to all our reviewers thus far!
Mark Grover is a Software Engineer at Cloudera, an Apache Bigtop committer, Apache Sentry (incubating) PMC member, and contributor to Hive, Flume, and Sqoop.
Ted Malaska is a Solutions Architect at Cloudera, and a contributor to Apache Avro, Flume, Apache Pig, and Hadoop.
Jonathan Seidman is a Solutions Architect on the Partner Engineering team at Cloudera.
Gwen Shapira is a Solutions Architect at Cloudera.