Getting Started with Impala (now in early release)—another book in the Hadoop ecosystem books canon—is indispensable for people who want to get familiar with Impala, the open source MPP query engine for Apache Hadoop. We spoke with its author, Impala docs writer John Russell, about the book’s origin and mission.
Why did you decide to write this book?
I wanted to do some long-form tutorials, discuss anti-patterns, and other kinds of things that you don’t often see in official documentation. The focus on SQL coding let me go deep on certain features rather than covering everything. I could demonstrate several ways of tackling the same problem and discuss the pros and cons of each approach. In the official docs, I try to optimize for Google searchers who want to jump into any page and quickly find the right answer.
Who is the intended reader?
Anyone who knows their way around a database, and is interested to learn how those SQL and data modelling skills translate to the world of Big Data and data science. I really focus on the SQL side of things from a developer perspective, which could be that of a data analyst, data scientist, someone writing a business intelligence application, or a student hoping to go into one of those fields.
What will readers learn?
How not to be intimidated when confronted with large volumes of data. I go through different ways to get data into Impala, organize it, and optimize it for queries. I figure once you’ve joined a billion-row table with a million-row table, that’s a good confidence builder. I want you to understand the reasons why query X performs better than query Y, so that when you encounter your own unique situation, you’ll be able to pull the right arrow from your quiver.
I’ve tried to distill all the gotchas and misunderstandings I’ve encountered when transitioning from one database platform to another, to help make readers comfortable in a heterogeneous environment. In my experience, data scientists can easily spend 75% of their time on SQL queries. At a data-oriented company, they might be involved with half a dozen different SQL-oriented systems.
Although I don’t cover administration-related aspects in detail, I provide tips to help developers design their schemas and code their SQL in DBA-friendly ways.
What are some particularly interesting things about Impala that most people don’t know?
That Impala doesn’t have all that many knobs to turn while doing performance tuning. Yet each one can have a dramatic impact on performance and scalability. The key to happiness is often one crucial SQL statement like
COMPUTE STATS, or even just a
CREATE TABLE clause for partitioning or file format. Also, I think it’s fascinating that many aspects have a kind of “donut hole” where the choices on the extreme ends matter, but you can ignore the ones in the middle.
For example, I focus mostly on the file formats that provide maximum convenience or maximum query speed, and skip over the ones that fall somewhere in between. To illustrate performance and distributed queries, I’ll use tables that are tiny or huge. Tables somewhere in the middle (with sizes like 1MB, 10MB, or 100MB) are basically all the same from a Big Data perspective.
Do you foresee this book having future editions, and if so, what do you think they would add?
I’m expecting plenty of interesting topics for future editions. The new features in Impala 2.0 and beyond open up new use cases for other kinds of queries, ETL techniques, and porting tips. As people explore more and more Impala features, they’ll find opportunities to trade-off between more convenience and better performance. All good subjects for more tutorials and deep dives!
Meet John in person/get a signed copy in the Cloudera booth at Strata+ Hadoop World, at 4pm ET on Fri., Oct. 15.