How-to: Deploy a Secure Enterprise Data Hub on Microsoft Azure – Part 2

Categories: CDH Cloud How-to Ops and DevOps Platform Security & Cybersecurity

In Part 1 of the blog, we covered all the prerequisites  needed to deploy a CDH cluster on the Microsoft Azure cloud platform. In Part 2, we will cover the resources required on the Azure platform and actually deploy a cluster with Cloudera Director.

Cloudera Director Use Case

Cloudera Director simplifies cluster creation and lessen the time to an operational cluster on the cloud. It’s a great tool for running POCs in your organization.

Read More

How-to: Fuzzy Name Indexing in Apache Hadoop with Rosette and Cloudera Search

Categories: CDH How-to Search

In this guide, learn how to use Cloudera Search with Basis Technology’s Rosette®  to perform fuzzy name searches in multiple languages and scripts.

Our thanks to Basis Technology team (Jeanne Le Garrec, Hannah MacKenzie-Margulies and Brian Sawyer) for supporting writing this how-to blog.

Cloudera Search, powered by Apache Solr brings full-text, interactive search, and scalable indexing to Apache Hadoop by marrying SolrCloud with HDFS, Apache HBase,

Read More

How-to: Deploy a Secure Enterprise Data Hub on Microsoft Azure – Part 1

Categories: CDH Cloud Hadoop How-to Ops and DevOps Platform Security & Cybersecurity

 

Learn how to use Cloudera Director, Microsoft Active Directory (AD DS, AD CS, AD DNS), SAMBA, and SSSD to deploy a secure EDH cluster for workloads in the public cloud.

Authenticating users in Apache Hadoop is the first line of security we recommend. Like most, if not all RDBMS, a user is provided with a username and a password to validate their identity. This is a requirement to access any data managed by those systems.

Read More

HDFS DataNode Scanners and Disk Checker Explained

Categories: CDH Hadoop HDFS

As many of us know, data in HDFS is stored in DataNodes, and HDFS can tolerate DataNode failures by replicating the same data to multiple DataNodes. But exactly what happens if some DataNodes’ disks are failing? This blog post explains how some of the background work is done on the DataNodes to help HDFS to manage its data across multiple DataNodes for fault tolerance. Particularly, we will explain block scanner, volume scanner,

Read More

How-to: Automate Your sparklyr Environment with Cloudera Director

Categories: Cloudera Manager Data Science Hadoop How-to Ops and DevOps Spark

Since the launch of sparklyr, working with Apache Spark in Apache Hadoop has become much easier for R users. sparklyr contains a dplyr interface into Spark and allows users to leverage crucial machine learning algorithms from Spark MLlib and H2O Sparkling Water. This greatly reduces the barrier of entry for R users in adopting Spark as a tool for big data and should go a long way in enabling R workloads to migrate to Hadoop.

Read More