When it comes to machine learning (ML) in the enterprise, there are many misconceptions about what it actually takes to effectively employ machine learning models and scale AI use cases. When many businesses start their journey into ML and AI, it’s common to place a lot of energy and focus on the coding and data science algorithms themselves. While it’s important to have the in-house data science expertise and the ML experts on-hand to build and test models, the reality is that the actual data science work — and the machine learning models themselves — are only one part of the broader enterprise machine learning puzzle. The rest of the puzzle centers around breaking down workflow silos, and creating a seamless machine learning lifecycle — from the data preparation and pipelines that power the models, to managing the move to production for the models, and finally to how data science teams can enable those machine learning models to meaningfully impact the business through explainable and trustworthy predictive applications.
Accelerating the Full Machine Learning Lifecycle With Cloudera Data Platform
At Cloudera, we’re focused on making enterprise organizations successful with machine learning. First and foremost, we designed the Cloudera Data Platform (CDP) to optimize every step of what’s required to go from raw data to AI use cases. Part of this strategy is to focus our product development efforts to creatively and proactively tackle real enterprise problems experienced by our customers on a day to day basis. In 2020 alone, we released three entirely new services and product features designed to address significant lifecycle challenges and innovate ahead of the competition to tackle emerging problems associated with cutting-edge use cases such as building predictive applications and powering ML models at the edge.
Using CDP Data Engineering For Automating Machine Learning Pipelines
Data Engineering has become one of the most rapidly growing data professions in the world, yet there has been a market gap in purpose-built tooling specifically designed to enable better data engineering workflows. While many data science and machine learning tools offer basic data engineering functionalities such as job scheduling; modern data engineering requires forward-thinking approaches to data curation as well as pipeline orchestration, automation, and optimization.
“Through 2023, data scientists and analysts will lose 60% to 70% of their productive time to activities like finding, preparing, integrating and sharing datasets, making data engineers a must-have persona on their teams.”
Laurence Goasduff, Gartner
In August 2020 we released CDP Data Engineering (DE) — our answer to enabling fast, optimized, and automated data engineering for analytic workloads. Building on what has become the de facto computing framework for modern data engineering — Apache Spark — DE is an all-inclusive data engineering toolset that enables orchestration automation with Apache Airflow, advanced pipeline monitoring, visual troubleshooting, and comprehensive pipeline management tools to streamline ETL processes across enterprise analytics teams. For data science and machine learning teams, this means the data used in machine learning models is optimized, always-on, and perpetually accurate — enabling more AI use cases with lower risk for decision-makers.
Build, Deploy, and Operate ML Models Faster with CDP Machine Learning
CDP Data Engineering seamlessly integrates with and automates data pipelines to CDP Machine Learning. Data scientists can build models directly from this data without moving or transferring any workloads, then deploy models into production with just a few clicks. Getting to production and creating trust with decision-makers is one of the biggest blockers to successful AI use cases. This is why, in addition, to secure self-service access to ML data pipelines, libraries, runtimes, and IDEs; CDP Machine Learning enables best-in-class production ML capabilities and insight sharing features that make it simple to deploy, monitor, govern, and deliver results everywhere across the business. Unique to CDP Machine Learning is the ability to monitor not just model performance, but also individual predictions down to the feature level. This is especially useful for deeply analyzing and ground-truthing models in production, then automating model retraining based on changing accuracy resulting from continuous learning
To further enable interpretability and trust across the ML lifecycle, we recently released Cloudera Shared Data Experience for models. This brings automatic model lineage tracking, as well as audit trails and model cataloging to CDP Machine Learning — enabling a greater degree of control transparency, and trust when deploying and operating models in production.
Putting It All Together And Delivering Actionable AI Use Cases
So far we’ve explored how CDP integrates and simplifies end-to-end machine learning workflows from data curation, to model development, and production ML. These innovations solve many of the problems enterprises have with getting ML models to production, but one of the biggest and often most unseen challenges with ML adoption has to do with how technical teams enable business users and decision-makers to trust and take action from the resulting predictions. An all-too-common barrier to successful enterprise machine learning is failure to create trustworthy and explainable results for business teams. To tackle this last-mile issue, we released CDP Data Visualization — An easy to use, self-service dashboarding and intelligent reporting tool included with CDP Machine Learning out of the box.
CDP Data Visualization enables everybody across the ML lifecycle to quickly and easily share insights and build complete predictive reporting applications in a drag and drop interface. ML models can be exposed and queried to make new predictions in an end-user application — effectively completing the ML lifecycle and delivering true end-to-end ML that makes it easy to adopt and scale AI use cases across the business.
To learn more about how Cloudera Data Platform accelerates the full machine learning lifecycle to deliver greater impact from your AI investment, watch the latest webinar: Enable The Full ML Lifecycle For Scaling AI Use Cases.