Authored by a substantial portion of Cloudera’s Data Science team (Sean Owen, Sandy Ryza, Uri Laserson, Josh Wills), Advanced Analytics with Spark (currently in Early Release from O’Reilly Media) is the newest addition to the pipeline of ecosystem books by Cloudera engineers. I talked to the authors recently.
Why did you decide to write this book?
We think it’s mostly to fill a gap between what a lot of people need to know to be productive with large-scale analytics on Apache Hadoop in 2015, and the resources that are out there. There are plenty of books on machine learning theory, and plenty of references covering how to use Hadoop ecosystem tools. However, there is not as much specifically targeting the overlap between the two, and focusing on use cases and examples rather than being a manual. So the book is a modest attempt to meet that need, which we see turn up frequently among customers and in the community.
Who is the intended reader?
The ideal reader is a data scientist or aspiring data scientist. “Data scientist” has come to mean quite a few things, but the book is targeted specifically at the subset who are interested in analysis on large datasets, and who are motivated to learn a bit about the software and mathematical underpinnings of doing so. It will be most useful for people who want to get their heads around the basics of machine learning but are more interested in its application than the theory.
Different chapters appeal to different levels of experience in different fields. For example, the second chapter, on record linkage, seeks to teach the basics of using Scala and Apache Spark to work with data, while the eighth chapter, on estimating financial risk through Monte Carlo simulation, assumes a basic understanding of probability and statistics.
What will readers learn, and how does it complement what they will learn from other titles on the market?
Readers ought to pick up the 20% of Spark that’s used 80% of the time in practice. It’s not a reference by any means; Learning Spark (also in Early Release at the time of this writing) is the “definitive” guide. Likewise, it gives enough machine-learning theory to use Spark as a tool for analytics correctly but is not a textbook or ML course. It still complements, say, Coursera’s free online ML courses.
What makes Spark so different in this particular area? Why do people need to know about this?
The first couple chapters of the book actually try to answer this question, and we think it comes down to couple things. Spark is just far more developer-friendly than its predecessor frameworks that process large datasets. Its rich library of operators makes expressing complex transformations easy, and the interactive environment it provides enables exploratory analysis. Spark also has primitives that open up many of the processing patterns required by machine-learning algorithms. It’s relevant for exploratory as well operational analytics.
None of these capabilities are individually new, but having one platform that does a decent job at all of them is powerful. Its abstractions strike a nice balance between forcing the user to write programs that can scale to lots of data and allowing them to think about things at a high level.