Putting Machine Learning Models into Production

Categories: AI and Machine Learning Cloudera Data Science Workbench Spark

Once the data science is done (and you know where your data comes from, what it looks like, and what it can predict) comes the next big step: you now have to put your model into production and make it useful for the rest of the business. This is the start of the model operations life cycle. The key focus areas (detailed in the diagram below) are usually managed by machine learning engineers after the data scientists have done their work.

Read more

HDFS Erasure Coding in Production

Categories: CDH

HDFS erasure coding (EC), a major feature delivered in Apache Hadoop 3.0, is also available in CDH 6.1 for use in certain applications like Spark, Hive, and MapReduce. The development of EC has been a long collaborative effort across the wider Hadoop community. Including EC with CDH 6.1 helps customers adopt this new feature by adding Cloudera’s first-class enterprise support.

While previous versions of HDFS achieved fault tolerance by replicating multiple copies of data (similar to RAID1 on traditional storage arrays),

Read more

Visual Model Interpretability for Telco Churn in Cloudera Data Science Workbench

Categories: CDH Cloudera Data Science Workbench Fast Forward Labs Spark

Disclaimer: the scenario below is hypothetical.   Any similarity to any specific telecommunications company is purely coincidental.  

Although we use the example of a telecommunications company the following applies to every organization with customers or voluntary stakeholders.  


Imagine that you are a Chief Data Officer at a major telecommunications provider and the CEO has asked you to overhaul the existing customer churn analytics.  The current process relies on manual export of data from dozens of data sources including ERP,

Read more

CDH 6.2 Release: What’s new in HBase

Categories: CDH

Cloudera recently launched CDH 6.2 which includes two new key features in Apache HBase:

  1. Serial replication
  2. Bucket cache now supports Intel’s Optane memory

Serial replication

HBase has a sophisticated asynchronous replication mechanism that supports complex topologies today that include global round-robin, two way, span-in and span-out topologies.

This replication capability, to date, provides eventual consistency — meaning that the order in which updates are replicated is not necessarily the same as the order in which they were applied to the database.  

Read more

CDH6.2 – Cloudera Search Attribute Based Access Control Part 2

Categories: CDH Search Sentry

Cloudera Search is a highly scalable and flexible search solution based on Apache Solr which enables exploration, discovery and analytics over massive, unstructured and semi-structured datasets (for example logs, emails, dna-strings, claims forms, jpegs, xls sheets, etc). It has been adopted by a large number of Cloudera customers across a wide range of industries for high ROI and SLA-bound workloads, with many of those having strict requirements around security and compliance.

In CDH6.2 we introduce two new features to Cloudera Search relating to document-level security.

Read more