Over time, using the wrong tool for the job can wreak havoc on environmental health. Here are some tips and tricks of the trade to prevent well-intended yet inappropriate data engineering and data science activities from cluttering or crashing the cluster.
Take precaution using CDSW as an all-purpose workflow management and scheduling tool. Using CDSW primarily for scheduling and automating any type of workflow is a misuse of the service. For data engineering teams, Airflow is regarded as the best in class tool for orchestration (scheduling and managing end-to-end workflow) of pipelines that are built using programming languages like Python and SPARK. Airflow provides a trove of libraries and as well as operational capabilities like error handling to assist with troubleshooting.
Related but different, CDSW can automate analytics workloads with an integrated job-pipeline scheduling system to support real-time monitoring, job history, and email alerts. For data engineering and data science teams, CDSW is highly effective as a comprehensive platform that trains, develops, and deploys machine learning models. It can provide a complete solution for data exploration, data analysis, data visualization, viz applications, and model deployment at scale.
Impala vs Spark
Use Impala primarily for analytical workloads triggered by end users. Impala works best for analytical performance with properly designed datasets (well-partitioned, compacted). Spark is primarily used to create ETL workloads by data engineers and data scientists. It handles complex workloads well because it can programmatically dictate efficient cluster use.
Impala only masquerades as an ETL pipeline tool: use NiFi or Airflow instead
It is common for Cloudera Data Platform (CDP) users to ‘test’ pipeline development and creation with Impala because it facilitates fast, iterate development and testing. It is also common to then turn those Impala queries into ETL-style production pipelines instead of refining them using Hive or Spark ETL tools as best practices dictate. Over time, those practices lead to cluster and Impala instability.
So which open source pipeline tool is better, NiFi or Airflow?
That depends on the business use case, use case complexity, workflow complexity, and whether batch or streaming data is required. Use Nifi for ETL of streaming data, when real-time data processing is needed, or when data must flow from various sources rapidly and reliably. NiFi’s data provenance capability makes it simple to enhance, test, and trust data that is in motion.
Airflow comes in handy when complex, independent, typically on-prem data pipelines become difficult to manage as it facilitates the division of workflow into small independent tasks written in Python which can be executed in parallel for faster runtime. Airflow’s prebuilt operators can also simplify the creation of data pipelines that require automation and movement of data across diverse sources and systems.
Le Service à Trois
HBase + Phoenix + SOLr is a great combination for any analytical use case that goes against operational/transactional datasets. HBase provides the data format suited for transactional needs, Phoenix supplies the SQL interface, and SOLr enables index based search capability. Voilà!
Monitoring: should I use WXM or Cloudera Manager?
It can be difficult to analyze the performance of millions of jobs/queries running across thousands of databases with no defined SLA’s. Which tool provides better visibility and insights for decisioning?
Use Cloudera’s obervability tool WXM (Workload Manager) to profile workloads (Hive, Impala, Yarn, and Spark) to discover optimization opportunities. The tool provides insights into day to day query success and failures, memory utilization, and performance. It can compare runtimes to identify and analyze the root causes of failed or abnormally long/slow queries. The Workload View facilitates workload analysis at a much finer grain (e.g. analyzing how queries access a particular database, or how specific resource pool usage performs against SLAs).
Also use WXM to assess data storage (HDFS), which can play a significant role in query optimization. Impala queries may perform slowly or even crash if data is spread across numerous small files and partitions. WXM’s file size reporting capability identifies tables with a large number of files and partitions as well as compaction of small files opportunities.
Although WXM provides actionable insights for workload management, the Cloudera Manager (CM) console is the best tool for host and cluster management activities, including monitoring the health of hosts, services, and role-level instances. CM facilitates issue diagnosis with health test functions, metrics, charts, and visuals. We highly recommend that you have alerts enabled across your cluster components to notify your operations team of failures and to provide log entries for troubleshooting.
Add both Catalogs and Atlases to your library
Operating Atlas and Cloudera Data Catalog natively in the cluster facilitates tagging data and painting data lineage at both the data and process level for presentation via the Data Catalog interface.
As always, if you need assistance selecting or implementing the right tool for the right job, undertake Cloudera Training or engage our Professional Services experts.
Visit our Data and IT Leaders page to learn more.