Since my last blog, What you need to know to begin your journey to CDP, we received many requests for a tool from Cloudera to analyze the workloads and help upgrade or migrate to Cloudera Data Platform (CDP). The good news is Cloudera has a tried and tested tool, Workload Manager (WM) that meets your needs. WM saves time and reduces risks during upgrades or migrations.
How WM helps the Move to CDP
WM reveals strengths and weaknesses in workloads that run on Cloudera clusters. Using WM, you get an in-depth understanding of workload problems which leads to finding the root cause. WM simplifies troubleshooting failed jobs and optimizing slow jobs. After a job ends, WM gets information about job execution from the Telemetry Publisher, a role in the Cloudera Manager Management Service. Performance metrics appear in charts and graphs.
WM compares the current and previous jobs by creating baselines for identifying and addressing performance problems.
WM can help with:
- Preparing Hive, Impala and Spark workloads in legacy CDH clusters for upgrade or migration to CDP.
- Preparing Hive and Spark workloads in legacy HDP clusters for upgrade or migration to CDP.
- Identifying issues such as resource contention, rogue users and efficiently written SQL
- Providing prescriptive tuning to address identified issues
- Establishing performance baselines between CDH/HDP and CDP
- Suggesting workloads that should move to public cloud and understanding the public cloud costs. WM offers bursting to the public cloud for Impala workloads. Bursting to the public cloud for other workloads such as Hive or Spark are planned for future WM releases.
In this blog, we walk through the Impala workloads analysis in iEDH, Cloudera’s own Enterprise Data Warehouse (EDW) implementation on CDH clusters. We will also use WM to evaluate workloads for upgrade and migration.
Analyze iEDH workloads with WM for upgrade and migration
Identifying common iEDH issues, such as resource contention, rogue users, and inefficiently written SQL can simplify the move to CDP and isolate upgrade problems.
Identifying Resource Usage
Take a look at the Resource Consumption graph of an iEDH workload. From this graph you might spot resource contention.
When we look at WM, we spot the spikes in CPU core hours and memory use. To rule out a serious problem, we take a look at resource consumption history. We might find the root cause by realizing that a problem recurs at a particular time, or coincides with another event.
As part of upgrade and migration planning, it’s a good idea to eliminate resource contention; otherwise, problems will resurface in CDP as well. We can gauge when to move all workloads based on the resource usage that WM reveals.
Identifying Rogue Users
We detect a rogue user (any user consuming excessive resources) by comparing their resource consumption (CPU, Memory) with the queries they issue. For example, a user identified by “3xksle8z” runs only 3% of the queries, yet consumes far more memory than any other user, consuming about 5.9 PiB.
Which queries executed by 3xksle8z are consuming the most memory? We define a new workload called user_3xksle8z to track 3xksle8z’s resource consumptions and understand the nature of their queries.
Identifying Inefficiently Written SQLs & Provide Prescriptive Tuning Recommendations
Looking at the duration or complexity of the queries, we uncover queries that have not been written in an optimal way. For example, we see a large number of joins in these queries:
Too many joins and inline views characterize inefficiently written SQL. Even though Impala can process hundreds of joins in a minute, we need to find and reduce any inordinate number of joins. We can drill down to see performance issues related to the SQL statement. WM dissects inefficient SQL and issues prescriptive tuning recommendations, such as to denormalize tables and to materialize inline views.
Performance Baseline between iEDH on CDH5 and CDP
You can establish baselines from the health issues generated by WM.
We compare the current run of a job to a baseline derived from performance metrics. For example, you compare a job that ran 1:00 am with the baseline that ran at a different time. Before moving to CDP, pick a few workloads with maximum impact on CDH, and establish a CDH baseline. After moving to CDP, take a snapshot to use as a CDP baseline. If you see significant deviation between the CDH and CDP baselines, you may drill down to understand the effects of the upgrade.
Alternatively, you can just look at WM trends instead of baselines. From trends, you can see what happened in the past.
Identify Resource Hungry Workloads
We can identify resource-hungry workloads with WM. WM is typically used to explore clusters and workload health before migrating Impala workloads to CDP. For example, we can compare the resource consumption of a particular workload against all workloads in the cluster. In the chart below, a single user consumes half the memory (1.9 TiB vs. 3.7 TiB).
If you burst this user to the cloud how much pressure will it relieve from your on premises system? So you create another workload without that user and compare it over the same time frame. You see memory consumption is 84 GiB memory and 6K CPU vs 3.7TiB memory and 20K CPU.
We can optimize workloads that have problems before migrating them to CDP Public Cloud. It should be considered a best practice to perform an in-depth analysis before bursting a workload to the cloud. For example, we can analyze how queries that access a particular database, or use a specific resource pool, are performing against SLAs. We can see how queries from a specific user perform. We can determine if the system is running at capacity by looking at suboptimal queries.
Recommend workloads move to CDP based on workload patterns
For iEDH, there are three types of workload patterns observed during our analysis.
1. Fixed Reports / Data Engineering jobs
- Batched and scripted
- Distributed across a wide audience on a recurring schedule with fixed and predefined formatting
- Often mission-critical to the various lines of business (risk analytics, platform support, or data engineering), which hydrate critical data pipelines for downstream consumption
2. BI Interactive Reports or Dashboards
- A large volume of data produced from a report template that runs interactively (financial planning, for example)
- A unique challenge because these workloads are usually resource hungry
3. Ad-Hoc Reports or Exploration
- Self-serve data (no burden on IT)
- Involves a query that asks a specific business question
- Presented in a simple result set for one-time-use or a visually-pleasing format
Ad-Hoc reports are generated as needed and usually rely on much smaller amounts of data and consume far fewer resources compared to BI Interactive reports. Ad-hoc reporting is convenient for reporting on a specific data point that answers specific business questions. For example, to measure release quality, how many technical support cases were opened for certain components on a platform? Or, to measure release adoption, how many customers run certain releases? These are just a few examples. These ad-hoc reports often evolve into BI interactive reports with SLA requirements when more users need access.
Additional characteristics for each workload pattern in iEDH are shown below:
|Workload Characteristics||Fixed Reports / Data Engineering Jobs||Ad-Hoc Reports or Exploration||BI Interactive Reports or Dashboards|
|Report Format||Defined||Loosely / off-the-cuff||Defined|
|Resource Consumption||Perpetual Intensity||Far fewer (compared with BI)||Resource hungry (CPU & Memory)|
|Statement Types||DDL heavy||Query heavy||Query heavy (may include DDLs due to tooling drivers)|
|Query Complexity||Very Complex||Simple||Complex|
|Query Strings||Short (human generated)||Short (human generated)||Long (machine generated)|
|Query schedule||Recurring at certain time frame (e.g. 2AM)||Anytime during business hours||Anytime during business hours|
|SLA / Lifespan||Long lived||Various||Short lived|
CDP Form Factors Recommendations
Cloudera Data Platform (CDP) has two form factors:
- CDP Private Cloud is installable software which has a Private Cloud Base cluster and Kubernetes orchestrated containerized clusters for experiences such as Data Warehouse and Machine Learning.
- CDP Public Cloud is a service that can be run on multiple clouds (AWS, Azure) with improved and new experiences, such as Data Warehousing, Data Engineering, Machine Learning, Data Visualization, and Workload Management.
In CDP Private Cloud, customers can continue to build on the applications and code that has been developed and tuned for years with the Base cluster and create an optimized experience through features like workload isolation and compute storage separation with the Kubernetes orchestrated containerized clusters.
In CDP Public Cloud, customers can create new services or migrate existing services from on-premise clusters without installing and managing data platform software. CDP Public Cloud services are managed by Cloudera, but unlike other public cloud services, your data will always remain under your control in your VPC. CDP runs on AWS and Azure, with Google Cloud Platform coming soon.
Based on the workload patterns described above, we recommend CDP form factors optimally based on workload pattern characteristics. To make holistic recommendations, we combine multiple (primary and secondary) workload patterns to capture use cases more accurately. Recommendations are based on these complex use cases. In the event the primary workload is optional, the secondary workload becomes the primary workload.
|Use Case||Recommended CDP Form Factors||Benefits|
|Primary Workload||Secondary Workload|
|Data Engineering jobs only||CDP Data Engineering (Public Cloud & Private Cloud)||
|Data Engineering jobs||Fixed Reports||CDP Private Cloud Base||
|Data Engineering jobs (optional)||BI Interactive Reports||CDP Data Warehouse (Public Cloud or Private Cloud)||
|Data Engineering jobs (optional)||Ad-hoc Reports||CDP Data Warehouse, Public Cloud||
How to collect application logs
If you would like to experience what we shared in this blog for your own upgrade and migration to CDP from CDH, please reach out to your account team to arrange a trial. For your application logs collection, you can either enable Cloudera Manager (CM) to send live data to WM automatically, or manually collect and upload your workloads from your clusters.
Automated Mode (preferred option)
For enabling the automated mode you must be on or upgrade Cloudera Manager to version CM 6.2+. CM will automatically send the workload metrics directly to your WM account.
In case you prefer not to upgrade to CM 6.2+, you can also evaluate WM via the manual mode. The following steps will walk you through how to manually extract the logs for Impala workloads from your own environment and upload them into WM. Your account team can assist you with the details.
- Provide the SMON information. The default location for this folder is /var/lib/cloudera-service-monitor/impala/work_details
- Create a tarball for this folder using following command
#> tar -czvf name_of_archive.tar.gz /path/to/directory/listed/above
- Send Cloudera the tarballs via a support case titled “WM Evaluation for <your name>”
- After the upload is complete, you should be able to see the workloads populate in your WM evaluation account
Take the next steps on your journey to CDP from CDH right away – by leveraging WM’s unique capabilities to reduce migration/upgrade time and risk. Please reach out to your account team to arrange a trial for your own environment if you don’t have WM running in your environment. To learn more about CDP, please check out the CDP Resources page. As always, please provide your feedback in the comments section below.