Gartner states that “By 2022, 75% of new end-user solutions leveraging machine learning (ML) and AI techniques will be built with commercial instead of open source platforms”¹. Spoiler alert: it’s not because data scientists will stop relying on open source for the latest innovation in ML algorithms and development environments. But rather as businesses look to operationalize machine learning capabilities at scale, they’ll turn increasingly to commercial platforms, with connectors to open source, where investments in enterprise features like collaboration, reuse, transparency, model management and data platform integration have been focused.
Data management for ML/AI – what’s the big deal?
Most would maintain that the majority of data scientists’ time is still spent on collecting and preparing data for analysis. Particularly with continuing rapid evolution of open source and commercially available algorithms or even pre-trained models, the importance of slashing time spent on data gathering and pre-processing only grows. And once a model has been trained, tuned and optimized, data scientists want to put it to work for the business ASAP. Yet it can take months to deploy models to production, and we’ve met with more than a few organizations noting instances where even an experienced team’s models are not making it to production at all.
What emerges is the criticality of a data strategy and core data management competency, including both data and model management, to support enterprise ML initiatives. In recent technical advice on creating a data strategy for machine learning, Gartner concludes that “The data-preprocessing architecture that transports and integrates data for ML is the connective tissue of the data strategy. Without it, ML projects become disjointed and difficult to scale and maintain.”² While open source frameworks and standalone ML services can complement a data strategy for ML, they are not a substitute and won’t solve for the implementation of a scalable pre and post-processing architecture for the complete ML lifecycle that accounts for the complexities of dealing with big data while ensuring security and data quality for data science pipelines.
Build a future-proof AI factory on your foundation – Today
Cloudera customers can start building enterprise AI on their data management competencies today with the Cloudera Data Science Workbench (CDSW). CDSW gives data scientists the freedom to use their favorite open source and other vendor tools and libraries for the end-to-end ML workflow in addition to secure, self-service access to corporate data and distributed computing power, all managed efficiently and securely by IT. Data scientists and engineers can collaborate on shared projects for tasks ranging from data ingest and preparation to model training and deployment in production, all from one cohesive experience accessible from anywhere through a web browser.
And as part of Cloudera’s data platform for unified, multi-function analytics on shared data anywhere, CDSW brings data science securely to your data and other analytics workflows, capitalizing on your foundational enterprise data management capabilities versus driving silos and the associated costs and security risks.
And we couldn’t mention future-proofing without the Cloudera Data Platform (CDP), Cloudera’s next-generation platform and the industry’s first Enterprise Data Cloud. CDP will deliver a new cloud-native machine learning service that provides all the benefits of CDSW as a serverless experience in the cloud, scaling seamlessly from simple R and Python analysis to distributed Tensorflow and Spark workloads. Stay tuned.
Learn more about the Cloudera Data Science Workbench for the end-to-end ML workflow at our bi-weekly webinar series featuring live expert demos and Q&A. Register today!
By Bethann Noble, Director Product Marketing Machine Learning at Cloudera
- Gartner, Top 10 Data and Analytics Technology Trends That Will Change Your Business, Rita Sallam, Donald Feinberg, Mark Beyer, W. Roy Schulte, Alexander Linden, Joseph Unsworth, Svetlana Sicular, Nick Heudecker, Ehtisham Zaidi, Adam Ronthal, Erick Brethenoux, Pieter den Hamer, Alys Woodward, 11 April 2019
- Create a Data Strategy for Machine Learning in Advanced Analytics Initiatives, Carlton Sapp, 10 May 2019