Cloudera Data Platform (CDP) Private Cloud is the most comprehensive on-premises platform for integrated analytics and data management. It combines the best of Cloudera Enterprise Data Hub and Hortonworks Data Platform Enterprise Plus, and brings the latest and greatest open source technologies for data management and analytics to the data center.
With the latest version (7) of CDP Private Cloud Base, we’ve introduced a number of new features and enhancements. In this blog post, we would like to share the performance improvements available in Apache HBase.
For those who are new to HBase or are evaluating it for a new project, HBase is a non-relational distributed database that is trusted by architects and developers who want to process large volumes of data in a timely and reliable manner.
For this performance comparison, we measured HBase2 available in CDP Private Cloud Base 7 to Hbase1 available in CDH 5 using YCSB workloads. The comparison helps us understand the performance improvements and implications for customers doing in-place upgrades with no changes to underlying hardware.
Note: Customers who are upgrading from CDH 5 to CDP 7 will get an HBase upgrade from HBase1 to HBase2 as well.
- Custom YCSB Update Only workload
- Our custom YCSB Update Only workload performs
- 100% UPDATE operations
- An application example would be a metrics store
- Workload performance: CDP 7 YCSB Update Only workload run throughput (operations per second) was 20% better than when run with CDH5
- Our custom YCSB Update Only workload performs
- YCSB WorkloadA
- YCSB Workload A performs
- 50% READ operations
- 50% UPDATE operations
- An application example would be a session store recording recent actions in a user session
- Workload performance: CDP Private Cloud Base 7.1 HBase2 YCSB workload A throughput (operations per second) was 15% better than CDH5 HBase1
- YCSB Workload A performs
- YCSB Workload C (Read Only)
- YCSB Workload C is a read only workload and performs
- 100% READ operations
- An application example would be read user profile cache when profiles are constructed elsewhere (e.g Hadoop) or a banking system to access and view account statements
- Workload performance: CDP 7 YCSB workload C had similar throughput (operations per second) to CDH 5
- YCSB Workload C is a read only workload and performs
Verdict – CDP 7 provides improved performance than CDH 5 in YCSB
Custom UpdateOnly workload: CDP 7 YCSB Update only workload performed 20% better than C5.
YCSB Workload A: CDP 7 YCSB workload A performed 15% better than CDH5.
YCSB Workload C: CDP 7 YCSB read only workload C had similar operations/throughput to CDH 5
During our testing, we noticed that upgrading from JDK8 to JDK 11 within CDP 7 can improve performance by another 10%. This is over and above the performance improvements gained by upgrading from CDH5 to CDP7.
CDP 7 comes with JDK8 installed by default, and supports an upgrade to JDK11. In our test runs, CDP 7 was updated to use JDK 11 for YCSB workload runs shown above. We ran the same workloads with JDK8 as well, and the test results showed JDK11 performance is 5-10% better as compared to JDK8, as shown in the below chart
To upgrade CDP 7 from JDK 8 to OpenJDK 11, please follow below steps:
Step 1: Install OpenJDK11 on all hosts using the below
RHEL
sudo yum install java-11-openjdk
Ubuntu
sudo apt install openjdk-11-jdk
Step 2: On the Cloudera Manager Server host only (not required for other hosts):
- Open the file /etc/default/cloudera-scm-server in a text editor.
- Edit the line that begins with export JAVA_HOME (if this line does not exist, add it) and change the path to the path of the new JDK (the JDK is usually installed in /usr/lib/jvm)(or /usr/lib64/jvm on SLES 12), but the path may differ depending on how the JDK was installed).
For more info on upgrading JDK please follow Upgrading the JDK
Test Environment
Test Methodology
CDH 5.16.3/HBase1 was installed on the cluster and workload data with 1 billion rows (Dataset size 1TB) was generated and CDH 5.16.3 YCSB workloads were run. After loading, we waited for all compaction operations to finish before starting the workload test.
Once CDH 5.16.3 runs were completed, CDP Private Cloud Base 7.1 HBase2 was clean-installed and the data re-generated on the same cluster. The CDP Private Cloud Base 7.1 YCSB workloads were then run to get the test timings. Before every workload run, we initialized the HBase table used by YCSB. Snapshot of the usertable utable_snap were created and applied before every run.
Each workload tested was run 3 times for 15min each to measure throughput*. The results shown are the averages taken from the 3 tests.
*Throughput (ops/sec) = No. of operations per second
CDP Private Cloud Base 7.1 includes HBase2 and CDH 5.16.3 includes HBase1. Both CDP Private Cloud Base 7.1 and CDH5 have JDK 8 installed. CDP Private Cloud Base 7.1 supports JDK11 and CDP Private Cloud Base 7.1 was updated to use JDK 11 for YCSB testing, CDH 5.13.3 runs were run with JDK 8 (1.8.0_141)
Test configurations
- YCSB Version 0.17.0
- YCSB Binding Version hbase2(CDP-CD 7.1) and hbase1(CDH 5)
- YCSB clients used 2
- YCSB threads per client 20
- Data size
- YCSB table @1TB scale
- Total number of records in the YCSB table 1,000,000,000 (1TB), each record is 1KB
- Number of Regions in the YCSB table 250, with 5+1 node cluster its approx 50 regions per region server
- Average Region storage space used per server size 290G
- HBase Region servers were configured with 32GB heap
- Only L1 cache with LruBlockCache was used with 12.3 GB cache size
- L1 cache hit percent observed during runs on region servers was 85%
- L2 off heap cache was not configured on the cluster
Cluster configs
- Cluster used : 6 node cluster (1 master + 5 region servers)
- Description: Dell PowerEdge R430, 20c/40t Xenon e5-2630 v4 @ 2.2Ghz, 128GB Ram, 4-2TB disks
- Security: None configured (No Kerberos)
Cloudera versions compared
C7 Version: CDP Private Cloud Base 7.1.0
C5 Version: CDH5.16.3
JDKs used: JDK 8 (1.8.0_141) and JDK 11(11.0.6)
Based on our testing (results above), customers looking to upgrade from CDH 5 to CDP 7 should expect improved performance for similar workloads as compared to what they are getting today.
Learn more about Cloudera Operational DB here
Hi Surbhi,
Thank you for this research and the detailed article! Really informative!
Do you have a similar research coming up on CDH6 vs CDP?
Thank you!
Liviu
Thanks Liviu. I am glad that this article was helpful to you. We’ll definitely consider your suggestion.