Tag Archives: Data Science

Cloudera SDX: Under the Hood

Categories: CDH

What is SDX?

Shared Data Experience — SDX — is Cloudera’s secret ingredient that makes it possible to deploy Cloudera’s four core functions (Data Engineering, Data Science, Analytic DB, Operational DB) on a single platform.

Why does that matter?

First, each of those core functions is essential to any modern enterprise business.

  • Data Engineering enables the business to run batch or stream processes that speed ETL and train machine learning models
  • Data Science enables the business to do exploratory data science at big data scale with full data security and governance
  • Analytic DB delivers the fastest time-to-insight with the flexibility and agility to run in any environment and against any type of data.

Read more

Big Data Architecture Workshop

Categories: Training

Since the birth of big data, Cloudera University has been teaching developers, administrators, analysts, and data scientists how to use big data technologies. We have taught over 50,000 folks all of the details of using technologies from Apache such as HDFS, MapReduce, Hive, Impala, Sqoop, Flume, Kafka, Core Spark, Spark SQL, Spark Streaming, and Spark MLlib.

For administrators we’ve taught them how to plan, install, monitor, and troubleshoot clusters. For analysts we have shown them the power of SQL over large, diverse data sets.

Read more

How to Distribute your R code with sparklyr and Cloudera Data Science Workbench

Categories: CDH How-to Spark

sparklyr is a great opportunity for R users to leverage the distributed computation power of Apache Spark without a lot of additional learning. sparklyr acts as the backend of dplyr so that R users can write almost the same code for both local and distributed calculation over Spark SQL.

 

Since sparklyr v0.6, we can run R code across our Spark cluster with spark_apply().

Read more

How To Predict ICU Mortality with Digital Health Data, DL4J, Apache Spark and Cloudera

Categories: CDH Data Science Spark

Modeling EHR Data in Healthcare

In this case study, we take a look at modeling electronic health record (EHR) data with deep learning and Deeplearning4j (DL4J). We draw inspiration from recent research showing that carefully designed neural network architectures can learn effectively from the complex, messy data collected in EHRs. Specifically, we describe how to train an  long short-term memory recurrent neural network (LSTM RNN) to predict in-hospital mortality among patients hospitalized in the intensive care unit (ICU).

Read more

Customizing Docker Images in Cloudera Data Science Workbench

Categories: Altus CDH Cloud Data Science How-to Tools

This article shows how to build and publish a customized Docker image for usage as an engine in Cloudera Data Science Workbench. Such an image or engine customization gives you the benefit of being able to work with your favorite tool chain inside the web based application.

Motivation:

Cloudera Data Science Workbench (CDSW) enables data scientists to use their favorite tools such as R, Python, or Scala based libraries out of the box in an isolated secure sandbox environment.

Read more