Category Archives: CDH

Kafka Replication: The case for MirrorMaker 2.0

Categories: CDH

Apache Kafka has become an essential component of enterprise data pipelines and is used for tracking clickstream event data, collecting logs, gathering metrics, and being the enterprise data bus in a microservices based architectures. Kafka is essentially a highly available and highly scalable distributed log of all the messages flowing in an enterprise data pipeline. Kafka supports internal replication to support data availability within a cluster. However, enterprises require that the data availability and durability guarantees span entire cluster and site failures.

Read more

What’s New in Cloudera Altus Director 6.2?

Categories: CDH Cloud Cloudera Director

Cloudera Altus Director helps you deploy, scale, and manage Cloudera clusters on AWS, Microsoft Azure, or Google Cloud Platform. Altus Director both enables and enforces the best practices of big data deployments and cloud infrastructure. Altus Director’s enterprise-grade features deliver a mechanism for establishing production-ready clusters in the cloud for big data workloads and applications in a simple, reliable, automated fashion. In this post, you will learn about new functionality and changes in release 6.2.

Read more

Transparent Hierarchical Storage Management with Apache Kudu and Impala

Categories: CDH Impala Kudu Parquet

When picking a storage option for an application it is common to pick a single storage option which has the most applicable features to your use case. For mutability and real-time analytics workloads you may want to use Apache Kudu, but for massive scalability at a low cost you may want to use HDFS. For that reason, there is a need for a solution that allows you to leverage the best features of multiple storage options.

Read more

Using Native Math Libraries to Accelerate Spark Machine Learning Applications

Categories: AI and Machine Learning CDH Performance Spark

[Editor’s note: The original version of this article was published as part of our Guru How-To series for Data Science. Be sure to also check out the series for Cloudera Data Warehouse.]

 

Spark ML is one of the dominant frameworks for many major machine learning algorithms, such as the Alternating Least Squares (ALS) algorithm for recommendation systems, the Principal Component Analysis algorithm, and the Random Forest algorithm.

Read more

Integrating Machine Learning Models into Your Big Data Pipelines in Real-Time With No Coding

Categories: AI and Machine Learning CDH Cloudera Data Science Workbench How-to

[Editor’s note: This article was originally published on the Hortonworks Community Connection, but reproduced here because CDSW is now available on both Cloudera and Hortonworks platforms.]

Using Deployed Models as a Function as a Service

104409 dataengineering 104410 datascience 104431 flowmanagement

Using Cloudera Data Science Workbench with Apache NiFi, we can easily call functions within our deployed models from Apache NiFi as part of flows. I am working against CDSW on HDP (https://www.cloudera.com/documentation/data-science-workbench/latest/topics/cdsw_hdp.html), 

Read more