How-to: Use the New HDFS Intra-DataNode Disk Balancer in Apache Hadoop

Categories: CDH Hadoop HDFS

HDFS now includes (shipping in CDH 5.8.2 and later) a comprehensive storage capacity-management approach for moving data across nodes.

In HDFS, the DataNode spreads the data blocks into local filesystem directories, which can be specified using in hdfs-site.xml. In a typical installation, each directory, called a volume in HDFS terminology, is on a different device (for example, on separate HDD and SSD).

When writing new blocks to HDFS,

Read More

How-to: Secure Apache Solr Collections and Access Them Programmatically

Categories: Platform Security & Cybersecurity Search Sentry

Learn how to secure your Solr data in a policy-based, fine-grained way.

Data security is more important than ever before. At the same time, risk is increasing due to the relentlessly growing number of device endpoints, the continual emergence of new types of threats, and the commercialization of cybercrime. And with Apache Hadoop already instrumental for supporting the growth of data volumes that fuel mission-critical enterprise workloads, the necessity to master available security mechanisms is of vital importance to organizations participating in that paradigm shift.

Read More

How-to: Do Scalable Graph Analytics with Apache Spark

Categories: Data Science Graph Processing How-to Spark

Get started with scalable graph analysis via simple examples that utilize GraphFrames and Spark SQL on HDFS.

Graphs—also known as “networks”—are ubiquitous across web applications. As a refresher, a graph consists of nodes and edges. A node can be any object, such as a person or an airport, and an edge is a relation between two nodes, such as a friendship or an airline connection between two cities.

Read More

Introducing sparklyr, an R Interface for Apache Spark

Categories: Data Science Guest Spark

Earlier this week, RStudio announced sparklyr, a new package that provides an interface between R and Apache Spark. We republish RStudio’s blog post below (see original) for your convenience.


Over the past couple of years we’ve heard time and time again that people want a native dplyr interface to Spark, so we built one! sparklyr also provides interfaces to Spark’s distributed machine learning algorithms and much more. 

Read More

Apache Spark 2.0 Beta Now Available for CDH

Categories: Hadoop Spark

Today, Cloudera announced the availability of an Apache Spark 2.0 Beta release for users of the Cloudera platform.

Apache Spark 2.0 is tremendously exciting (read this post for more background) because (among other things):

  • The Dataset API further enhances Spark’s claim as the best tool for data engineering by providing compile-time type safety along with the benefits of a query-optimization engine.
  • The Structured Streaming API enables the modeling of streaming data as a continuous DataFrame and expresses operations on that data with a SQL-like API.

Read More