Once the data science is done (and you know where your data comes from, what it looks like, and what it can predict) comes the next big step: you now have to put your model into production and make it useful for the rest of the business. This is the start of the model operations life cycle. The key focus areas (detailed in the diagram below) are usually managed by machine learning engineers after the data scientists have done their work.
HDFS erasure coding (EC), a major feature delivered in Apache Hadoop 3.0, is also available in CDH 6.1 for use in certain applications like Spark, Hive, and MapReduce. The development of EC has been a long collaborative effort across the wider Hadoop community. Including EC with CDH 6.1 helps customers adopt this new feature by adding Cloudera’s first-class enterprise support.
While previous versions of HDFS achieved fault tolerance by replicating multiple copies of data (similar to RAID1 on traditional storage arrays),
Disclaimer: the scenario below is hypothetical. Any similarity to any specific telecommunications company is purely coincidental.
Although we use the example of a telecommunications company the following applies to every organization with customers or voluntary stakeholders.
Imagine that you are a Chief Data Officer at a major telecommunications provider and the CEO has asked you to overhaul the existing customer churn analytics. The current process relies on manual export of data from dozens of data sources including ERP,
Cloudera recently launched CDH 6.2 which includes two new key features in Apache HBase:
- Serial replication
- Bucket cache now supports Intel’s Optane memory
HBase has a sophisticated asynchronous replication mechanism that supports complex topologies today that include global round-robin, two way, span-in and span-out topologies.
This replication capability, to date, provides eventual consistency — meaning that the order in which updates are replicated is not necessarily the same as the order in which they were applied to the database.
Cloudera Search is a highly scalable and flexible search solution based on Apache Solr which enables exploration, discovery and analytics over massive, unstructured and semi-structured datasets (for example logs, emails, dna-strings, claims forms, jpegs, xls sheets, etc). It has been adopted by a large number of Cloudera customers across a wide range of industries for high ROI and SLA-bound workloads, with many of those having strict requirements around security and compliance.
In CDH6.2 we introduce two new features to Cloudera Search relating to document-level security.