Delivering Modern Enterprise Data Engineering with Cloudera Data Engineering on Azure

After the launch of CDP Data Engineering (CDE) on AWS a few months ago, we are thrilled to announce that CDE, the only cloud-native service purpose built for enterprise data engineers, is now available on Microsoft Azure. 

CDP Data Engineering offers an all-inclusive toolset that enables data pipeline orchestration, automation, advanced monitoring, visual profiling, and a comprehensive management toolset for streamlining ETL processes and making complex data actionable across your analytic teams. 

Delivered through the Cloudera Data Platform (CDP) as a managed Apache Spark service on Kubernetes, CDE offers unique capabilities to enhance productivity for data engineering workloads:

  • Visual GUI-based monitoring, troubleshooting and performance tuning for faster debugging and problem resolution
  • Native Apache Airflow and robust APIs for orchestrating and automating job scheduling and delivering complex data pipelines anywhere
  • Resource isolation and centralized GUI-based job management
  • CDP data lifecycle integration and SDX security and governance

For enterprise organizations, managing and operationalizing increasingly complex data across the business remains a big challenge. And to truly remain competitive, these organizations are required to consolidate and curate data being generated from a variety of sources and present it downstream for consumption where it can be used to drive business outcomes. With the announcement of CDP Data Engineering on Microsoft Azure, data engineers can now deploy CDE across two leading cloud infrastructure providers in order to take advantage of the powerful built-in tools to orchestrate and automate complex data pipelines. Prerequisites for deploying CDP Data Engineering on Azure can be found here.

Key features of CDP Data Engineering

Easy job deployment

For a data engineer that has already built their Spark code on their laptop, we have made deployment of jobs one click away. The user can use a simple wizard where they can define all the key configurations of their job.

CDE supports Scala, Java, and Python jobs. We have kept the number of fields required to run a job to a minimum, but exposed all the typical configurations data engineers have come to expect: run time arguments, overriding default configurations, including dependencies and resource parameters.

Flexible orchestration with Apache Airflow

CDE has a completely new orchestration service powered by Apache Airflow — the preferred tooling for modern data engineering. Airflow allows defining pipelines using python code that are represented as entities called DAGs and enables orchestrating various jobs including Spark, Hive, and even Python scripts. 

CDE automatically takes care of generating the Airflow python configuration using the custom CDE operator. By leveraging Airflow, data engineers can use many of the hundreds of community contributed operators to define their own pipeline. This will allow defining of custom DAGs and scheduling of jobs based on certain event triggers like an input file showing up in an S3 or ADLS bucket. This is what makes Airflow so powerful and flexible. Make sure to take the Airflow tour to learn more.

Automation APIs

A key aspect of ETL or ELT pipelines is automation. We built CDE with an API centric approach to streamline data pipeline automation to any analytic workflow downstream. All the job management features available in the UI uses a consistent set of APIs that are accessible through a CLI and REST allowing for seamless integration with existing CI/CD workflows and 3rd party tools.

Some of the key entities exposed by the API:

  • Jobs are the definition of something that CDE can run, usually composed of the application type, main program, and associated configuration. For example, a Java  program running Spark with specific configurations.  CDE also support Airflow job types. 
  • A job run is an execution of a job. For example, one run of a Spark job on a CDE virtual cluster.
  • A resource is a directory of files that can be uploaded to DE and then referenced by jobs. This is typically for application (e.g. .jar, .py files) and reference files, and not the data that the job run will operate on.

New in CDE 1.7: Spark 3 is now available in Tech Preview:

Spark 3 and Spark 2 can now both be run in virtual clusters on CDP Public Cloud. Key features of Spark such as history server, read and writes to external servers are available in the tech preview and internal benchmarking shows up to 30% improvement in performance over Spark 2. We’ll share more about Spark 3 and new features closer to GA.

Get Started

To get hands on with Cloudera Data Engineering on Cloudera Data Platform, sign up for CDP test drive today!

 

Varun Jaitly
More by this author

1 Comments

by Daniel on

Hi Varun,

Spark 3.2 will be also released soon. Do you plan to make it available shortly after release also on CDE?
This has great improvements in regards Hive integration and Koalas (Pandas API on Spark) is now merged to Spark.

Thanks.

Leave a comment

Your email address will not be published. Links are not permitted in comments.