As a member of Cloudera’s Partner Engineering team, I evaluate hardware and cloud computing platforms offered by commercial partners who want to certify their products for use with Cloudera software. One of my primary goals is to make sure that these platforms provide a stable and well-performing base upon which our products will run, a state of operation that a wide variety of customers performing an even wider variety of tasks can appreciate.
Cloudera Data Science Workbench (CDSW) provides data science teams with a self-service platform for quickly developing machine learning workloads in their preferred language, with secure access to enterprise data and simple provisioning of compute. Individuals can request schedulable resources (e.g. compute, memory, GPUs) on a shared cluster that is managed centrally.
While self-service provisioning of resources is critical to the rapid interaction cycle of data scientists, it can pose a challenge to administrators.
What is SDX?
Shared Data Experience — SDX — is Cloudera’s secret ingredient that makes it possible to deploy Cloudera’s four core functions (Data Engineering, Data Science, Analytic DB, Operational DB) on a single platform.
Why does that matter?
First, each of those core functions is essential to any modern enterprise business.
- Data Engineering enables the business to run batch or stream processes that speed ETL and train machine learning models
- Data Science enables the business to do exploratory data science at big data scale with full data security and governance
- Analytic DB delivers the fastest time-to-insight with the flexibility and agility to run in any environment and against any type of data.
At Cloudera, we’re always working to provide our customers and the Apache Spark community with the most robust, most reliable software possible. This article describes some recent engineering work on [SPARK-8425] that is available in CDH 5.10 and CDH5.11, as well as in upstream Apache Spark starting with the 2.2 release.
The work pertains to the Blacklist Tracker mechanism in Spark’s scheduler. This was the subject of a recent Spark Summit talk,
Cloudera has announced support for Spark SQL/DataFrame API and MLlib. This post explains their benefits for app developers, data analysts, data engineers, and data scientists.
In July 2015, Cloudera re-affirmed its position since 2013: that Apache Spark is on course to replace MapReduce as the default general-purpose data processing engine for Apache Hadoop. Thanks to initiatives like the One Platform Initiative,