Our thanks to Melanie Imhof, Jonas Looser, Thierry Musy, and Kurt Stockinger of the Zurich University of Applied Science in Switzerland for the post below about their research into the query performance of Impala for mixed workloads.
Recently, we were approached by an industry partner to research and create a blueprint for a new Big Data, near real-time, query processing architecture that would replace its current architecture based on a popular open source database system.
The goal of our research was to find a new, cost-effective solution that could accomplish the following:
- Perform analytic queries for terabyte-scale data sets with response times of only a few seconds
- Guarantee the above response times even with increasing data set sizes
- Guarantee the above response times even when concurrent users increase
In the remainder of this post, we will explain how we identified Impala as the best solution for these requirements, and refer you to the detailed results of our research.
Use Case Details
The current system manages dozens of different data sets being accessed by hundreds of users from various international locations. Moreover, the database feeds multiple web-based business intelligence reports for its customers that are updated on a regular basis. In other words, the architecture can be considered as a near-real-time data warehouse/business intelligence environment that enables advanced analytics for customers.
Furthermore, the data size as well as the user base that concurrently access the system have been increasing considerably — with data size increasing by up to 1,000x to be accessed by hundreds of users. Hence, the new architecture needed to be re-designed to cover new, multi-user workloads.
In addition to the requirements explained above, the chosen solution would also allow for relatively easy migration from the current open source database approach to the new architecture. Moreover, the existing business intelligence reports for the end-users should be migrated with minimal impact.
Given these requirements, we made the following strategic decisions for our architecture choice:
- We ruled out a traditional database system. The use case is a classic business intelligence problem with near-real-time requirements. However, given that our solution was supposed to be large-scale and also cost effective, we ruled out traditional commercial database systems.
- We ruled out Apache Hive. Although the presence of large data sets with the potential to scale by a factor of 10, 100, or 1,000 initially made Hive an option, our requirement for near-real-time query processing eventually ruled it out.
In contrast, as we explained in the introduction, we found that Impala could equivalently meet our partner’s needs for a near-real-time BI system at higher scale and lower cost.
The goal of our experiments was to go beyond the popular TPC-DS benchmark for evaluating decision-support workloads. Rather, our focus was on multi-dimensional point, range, and aggregation queries under consideration of concurrent user access for a real-world workload that is typical for decision-support systems.
In summary, we found that in a multi-user, multi-node environment, Impala’s query response time maintains a linear increase with the number of concurrent users, and that query response time is equally distributed across all users. However, when we also simulated the time needed for the concurrent users to interact with the query result by adding “sleep” time, the response dropped to the expected optimal execution time of a single user.
The details of our evaluation can be found in this technical report.
This work was funded as an applied research project/proof of concept by LinkResearch Tools.
Melanie Imhof is a research associate and a Ph.D. candidate at Zurich University of Applied Sciences and the University of Neuchâtel, Switzerland. She is interested in information retrieval, machine learning and related areas. She has obtained her M.Sc. in computer science from ETH Zurich with focus on machine learning and computer vision.
Jonas Looser is research associate at Zurich University of Applied Sciences (ZHAW). He has received a B.Sc. in computer science from ZHAW and has been working in different data science projects at ZHAW. His research interests are in data warehousing and business intelligence.
Thierry Musy is a research associate at Zurich University of Applied Sciences (ZHAW) in the areas of information retrieval, data science and big data. He works on joint research projects at ZHAW where cutting edge technologies are applied both in academia and industry. He has earned a B.Sc. in computer science at ZHAW and has worked in leading positions in the ICT and the financial sector.
Dr. Kurt Stockinger is an associate professor of computer science at Zurich University of Applied Sciences, Switzerland. His research interests are in data science with focus on big data, data warehousing, business intelligence and advanced database technology. He is also on the Advisory Board of Callista Group AG. Previously Kurt worked at Credit Suisse in Zurich, at Lawrence Berkeley National Laboratory in Berkeley, California, as well as at CERN in Geneva. He holds a Ph.D. in computer science from CERN/University of Vienna.