Accelerate Your Data Mesh in the Cloud with Cloudera Data Engineering and Modak Nabu

Accelerate Your Data Mesh in the Cloud with Cloudera Data Engineering and Modak Nabu

Modak, a leading provider of modern data engineering solutions, is now a certified solution partner with Cloudera. Customers can seamlessly automate migration to Cloudera’s cloud-based enterprise platform CDP from on-prem deployments and dynamically auto-scale cloud services with Cloudera Data Engineering (CDE)’s integration with Modak Nabu™.

Modak’s Nabu™ is a born- in- the- cloud, cloud-neutral integrated data engineering application designed to accelerate the journey of enterprises to the cloud. Modak empowers organizations to maximize their ROI from existing analytics infrastructure through interoperability. Nabu™ converges data cataloging, data ingestion, data profiling, data tagging, data discovery,curation of data productions and data exploration into a unified platform, driven by metadata, and by automating repetitive tasks in the data preparation helps to accelerate the process by 4x. And most importantly, Modak Nabu™  democratizes access to end-users, such as Data Engineering teams, Data Science teams, and citizen data scientists to data products, across the organization while ensuring compliance with data governance policies are met.

Cloud Speed and Scale to build out Enterprise Data Mesh

In the cloud, it’s more critical right now than ever to have portability across cloud providers and for hybrid deployments. With Cloudera CDP, enterprises can avoid vendor lock-in while being able to take advantage of key cloud capabilities such as elasticity and dissociated compute and storage. Also, enterprises can tap into new technologies like Kubernetes.

With Modak Nabu™ on CDP, enterprises can shift to cloud architectures with ease, with their choice of one or more cloud providers. They will automatically get the benefits of CDP Shared Data Experience (SDX) with enterprise-grade security and governance.

Modak Nabu™ reliably curates datasets for any line of business and personas, to deliver trusted data products to business analysts and data scientists. Customers using Modak Nabu™ with CDP today have deployed a Data Mesh and profiled their data at an unprecedented speed — in one use-case a pharmaceutical customer’s data lake and cloud platform was up and running within 12 weeks (versus the typical 6-12 months). Over 170 different data sources — from Oracle, MySQL, Hive, SAS, and many others — were ingested and profiled by Modak Nabu™, totaling over 80K tables at Petabyte scale. This is the scale and speed that cloud-native solutions can provide — and Modak Nabu™ with CDP has been delivering the same.

Modak Nabu™ and Cloudera CDE’s Spark-on-Kubernetes

Modak Nabu™ relies on a framework of “Botworks”, a series of micro-jobs to accomplish various data transformation steps from ingestion to profiling, and indexing. That is why having a flexible, and efficient Spark-based service was critical.

Cloudera Data Engineering within CDP provides:

  • Fully managed Spark-on-Kubernetes service that hides the complexity of running production DE workloads at scale.
  • Auto-scaling backed by Apache YuniKorn, a high-performance scheduler that provides resource quota management, FIFO, FAIR scheduling designed for the cloud.
  • Cost efficiencies by taking advantage of Spot instances
  • First-class APIs to support automation and CI/CD use cases for seamless integration 
  • Integrated security model 

Figure 1: CDE containerized service for operational management of spark workloads

As Spark jobs are deployed by Modak Nabu™, they are efficiently scheduled and executed on CDE’s autoscaling service that’s optimized for Kubernetes. With Virtual Cluster CDE can support multiple tenants and LOB, by providing strong isolation and per tenant compute quotas for cost management and chargeback models.

The first-class APIs provide full life-cycle management of the Spark pipelines and allows seamless integration with applications, suc h as Modak Nabu™.  This allows easy tracking of pipeline status, log management, and troubleshooting at the individual job level.

Search and Exploration of Data Products

Through profiling and indexing, Modak Nabu™ provides easy data discovery and exploration functionality to end-users whether it’s Data Scientists building machine learning models or Data Analysts building operational reports.

To explore a data set, the user can view the profile of the table. The profile provides a summarized view of the data product. It shows the number of distinct values, null values, range of values, and most frequent values for each column in the dataset. Users with required permission levels can add descriptions, ratings, reviews, tags to the dataset which helps to provide business context to other users. 

Figure 2:Modak Nabu™ search interface

Users can also search for business terms or entities within Data Products through the search interface in Modak Nabu™. For any entity, the related entities can be viewed using a traversable knowledge graph. That allows users to interact and trace the dependencies between their data at the granularity of attributes.

Modak Nabu™ provides role-based access control to ensure that data access is compliant with the enterprise’s data governance norms.

Figure 3:Users can traverse the Modak Nabu™ knowledge graph to understand relationship across entities

Automate Pipelines

To move data from source systems to analytics layers such as a data mesh, or data lake or data warehouse, automated pipelines can be created and configured in Modak Nabu™. Users can select the tables, files from the source, and the destination where these should be moved. Modak Nabu™ allows additional controls for advanced options such as handling schema drift or setting pre-conditions for running a pipeline. These pipelines are then scheduled to run – either once or at a recurring frequency using CDE’s autoscaling spark service. 

 

Data Operations – Observability

Modak Nabu™provides dashboards for extensive visibility into data operations – providing data observability to operational and executive teams.

For the operational team, the monitoring dashboard provides the real-time status of pipelines. The monitoring dashboard provides a unified interface to monitor the pipelines and helps in troubleshooting. The dashboard shows details about a pipeline such as its status, time taken for a run, status of previous runs, source(s), and destination for a pipeline, and provides access to view logs. 

The real-time monitoring dashboard helps to troubleshoot reasons for a pipeline failure and even retry specific failed tables or files. Significantly reducing the time taken by the engineering and operation teams to investigate reasons for any pipeline failures and fix them. 

Modak Nabu™ also provides business stakeholders a summarized view of key metrics related to data operations. The dashboard shows details of data connections crawled, pipelines run, and data profiling. The view presented on the dashboard can be customized based on user-defined tags. When a tag is applied, the numbers on the executive dashboard are updated to reflect metrics for that tag. 

Customized views of the dashboard can be saved and shared with other stakeholders. Allowing different stakeholders to have a common and real-time view of the progress of various data management activities.  

Conclusion

With the certification of Modak Nabu™ with Cloudera CDE, customers can now deploy data operations at scale in a cloud-agnostic way, with control over cost and performance. With security and governance of Cloudera’s enterprise data platform, the operational efficiencies provided by CDE service, and data ingestion, preparation and curation engine of Modak Nabu™  customers can break their data silos and unlock the value of their data to accelerate data-driven business decisions. Start your journey with a test drive and sign-up for a 60-day trial to see how Cloudera CDP and Modak Nabu™ can help.

Shaun Ahmadian
More by this author
Daniel Mantovani
More by this author

Leave a comment

Your email address will not be published. Links are not permitted in comments.