Performance is one of the key, if not the most important deciding criterion, in choosing a Cloud Data Warehouse service. In today’s fast changing world, enterprises have to make data driven decisions quickly and for that they rely heavily on their data warehouse service.
In this blog post, we compare Cloudera Data Warehouse (CDW) on Cloudera Data Platform (CDP) using Apache Hive-LLAP to Microsoft HDInsight (also powered by Apache Hive-LLAP) on Azure using the TPC-DS 2.9 benchmark. Microsoft recently announced their latest version of HDInsight 4.1. This benchmark is run on the Interactive Query HDInsight cluster using the latest version.
Though both the services are powered by an identical version of open source Apache Hive-LLAP, the benchmark results clearly demonstrate CDW is better suited out of the box to provide the best possible performance using LLAP:
- CDW outperformed HDInsight by over 40% in total query runtime for TPC-DS queries using the same hardware specs (see Figure 1).
- Queries on CDW run on an average 2.7x faster than on HDInsight providing overall faster response time (see Figure 2).
- The benchmark ran with 100% success on CDW. HDInsight in contrast had issues running query49, running out of memory likely due to poor estimates.
You can find all the benchmark scripts to set up and run the TPC-DS on 10TB scale here. In addition, scripts and HDInsight cluster configuration used for the benchmark can be found here. CDW is an analytic offering for Cloudera Data Platform (CDP). You can easily set up CDP on Azure using scripts here.
Benchmark Configuration
On CDW, when you provision a Virtual Warehouse against your Data Catalog (catalog of table and views), the platform provides fully tuned LLAP worker nodes ready to run your queries. There are no additional setup or configuration steps required to run the benchmark. Once the benchmark run has completed, the Virtual Warehouse automatically suspends itself when no further activity is detected. For the benchmark, we chose a “Small” Virtual Warehouse size of a 10 node cluster.
On HDInsight, we spun up 10 workers with the same node type as CDW for a like-for-like comparison. A few metastore configuration parameters had to be added to allow queries against large partitioned tables.
A TPC-DS 10TB dataset was generated in ACID ORC format and stored on the ADLS Gen 2 cloud storage. Both CDW and HDInsight had all 10 nodes running LLAP daemons with SSD cache ON.
Cloudera Data Warehouse vs HDInsight
For the benchmark, we performed three runs of each query and selected the run with lowest runtime. Doing multiple runs of the same query allowed us to measure performance with data cached on the SSD from the previous run. Total runtime was then calculated by aggregating the runtimes of all 98 queries.
As shown below in Figure 1, CDW outperformed HDInsight by over 40% in the overall runtime with CDW finishing the benchmark in just under 4 hours (14,386 seconds) vs HDInsight’s 6.74 hours (24,266 seconds).
The difference in performance is not limited to a small set of queries. We saw query performance improvements in CDW ranging from 2x to 40x in more than 60% of the benchmark, with the average speedup of 2.7x per query.
Conclusion
Using the latest and most well tuned Hive engine in the market, CDW is built and backed by the pioneer contributors to Apache Hive – LLAP projects and packages Cloudera’s complete knowledge and experience in tuning its platform for performance right out of the box. Rather than having to invest substantial time and effort to tune analytics for performance, organizations can get straight to what matters most: driving insight and value from their data.
In addition to better performance, CDW also provides a SaaS like experience to seamlessly manage your data lifecycle needs. Running on highly optimized Kubernetes engines, CDW can quickly and automatically scale up and down based on actual query workload, providing optimum utilization of cloud (public as well as private) resources and budget. Finally, CDW is offered in CDP along with other data lifecycle services – Data Engineering, Operational Database, Machine Learning, and Data Hub. CDP ensures end to end security, governance and metadata management consistently across all the services through its versatile Shared Data Experience (SDX) module.
Pranav has a good time k