Learn why running real workloads on Cloudera’s internal EDH cluster is an important step in the overall QA process before releases.
At Cloudera, we strive to deliver a stable, reliable Apache Hadoop-based platform without sacrificing cutting-edge features. (See this post for an introduction to that process.)
In the past, we have written about how the Cloudera Support organization’s internal cluster helps improve the customer experience via CDH components such as Apache Impala (incubating) and Cloudera Search. In this post, we’ll describe how running/upgrading to new releases on our internal enterprise data hub (EDH) cluster also helps improve the quality of Cloudera Enterprise (CDH and Cloudera Manager) itself as part of an multi-step QA process during the software-development life cycle that also includes unit testing, static/dynamic code analysis, fault injection, multi-dimensional integration/system testing, and validation on real workloads before any release is done.
Inside the Cluster
Our internal EDH is colorfully known as the “Game of Nodes” (GoN) cluster. It currently has 46 nodes of the following types:
- 2 Lannister nodes run management services like NameNode and HBase Master. These need lots of memory, but little disk space, so Lannister Nodes have 128GB of RAM but only four 1TB disks.
- 3 Baelish nodes run coordination services like Apache ZooKeeper and Journal Node. These processes are relatively lightweight, so Baelish Nodes have 96GB of RAM and four 1TB disks.
- 32 Hodor nodes are worker nodes running DataNode, RegionServer, Impala Daemon, Solr Server, and YARN Nodemanager roles. Because we run different types of services on these, they need a mixture of disk space and memory. These have 128GB of RAM and 12 2TB disks.
- 3 Raven nodes are dedicated Apache Kafka nodes for message delivery. Disk throughput is most important for these machines, but we just built them the same as Hodor Nodes for simplicity.
- 6 Snow nodes are edge nodes for running applications. It is not as important for edge nodes to have uniform hardware since they all run different processes. These vary from 32-128GB of RAM and 2-6 disks.
The EDH is responsible for two types of workloads:
- A customer support use case, which requires extremely fast ingestion of customer data into systems that can power a web interface, like Apache HBase, Impala, and Apache Solr. Ingestion and data modeling for support are focused on speed for looking up individual records.
- A reporting use case, in which bulk loads of business data are processed into reports using tools like Impala and Apache Hive. For this use case, data is modeled for ease of analysis over large volumes of data, with data stored in either Apache Parquet or Apache Avro formats.
Cloudera intentionally uses every component of the CDH stack in its internal cluster for a simple reason: there is no substitute for testing with production applications. Although every component is unit-tested upstream, and the Cloudera Engineering team does extensive testing of its own, some issues will only become apparent when real data is processed over long periods time. Furthermore, simulating the messy/asymmetric data associated with real-world applications is extremely difficult.
Fortunately, our EDH has more about 400TB of data that stretches back more than four years, with new data sources being added every week. So, it’s an ideal environment in which to verify that new releases work as expected, and it plays a critical role in improving product stability and reliability.
Since the launch of this enterprise data hub, we have filed more than 100 JIRAs across Cloudera Manager and all components of CDH. These issues range in severity from rewording error messages, to situations that could bring down entire machines. Using an EDH cluster internally on a daily basis for business-critical applications allows us to catch issues before they affect customers.
One of the most harrowing experiences for any organization is performing an upgrade. Taking a system that is working and bringing it to a new version requires planning, testing, and confidence that your software vendor’s new version will be stable and your applications will continue to run as expected.
Cloudera has made substantial investments into testing upgrades, as we need to verify that all supported upgrade paths work without issues. For each release, we test rolling and non-rolling upgrades from each previous version of CDH on each operating system we support. In addition, we run tests to make sure that applications will continue to function on the new version.
To expand on this application testing, before each new minor version (e.g. from CDH 5.6 to 5.7) is released, we upgrade the Game of Nodes cluster and run workloads on it for several weeks—allowing us to catch regressions introduced by version upgrades before customers can be affected. For example, based on experiences when upgrading to CDH 5.4, we filed over 30 JIRAs across nearly the entire CDH stack, some of which involved potential show-stoppers that were fixed prior to release:
Testing in Other Environments
As noted in the introduction, our efforts to run every platform component on the EDH cluster is only one step in the overall QA process. Running a production workload with real data and many users provides an opportunity to spot issues that appear more readily in a live deployment than in a test environment. Nevertheless, our EDH is a specific environment with specific workloads, and for that reason, running an internal cluster is not a QA strategy in itself.
Cloudera has dedicated Engineering resources for ensuring that all environments and workloads function as expected in every Cloudera Enterprise release. We test all supported configurations, simulate failures in clusters during tests, and test workloads that we do not run on our internal EDH cluster. Our internal cluster contributes test cases to the test suite that Engineering runs over all CDH deployments to ensure that releases maintain enterprise levels of stability and reliability.
Cloudera is dedicated to creating an open source platform that meets the highest quality expectations for business-critical systems. Running production workloads on an internal enterprise data hub that is continually upgraded to the current versions of CDH and Cloudera Manager is a critical component in making Cloudera’s platform an enterprise-ready, stable foundation for production applications.
In the next installment, we’ll cover our recently open sourced distributed testing framework, which cuts upstream unit-testing time for Apache Hadoop components from hours to just 10 minutes.
Alan Jackoway is a Software Engineer at Cloudera.