Category Archives: Hadoop

RecordService: For Fine-Grained Security Enforcement Across the Hadoop Ecosystem

Categories: Hadoop Impala Security Sentry

This new core security layer provides a unified data access path for all Hadoop ecosystem components, while improving performance.

We’re thrilled to announce the beta availability of RecordService, a distributed, scalable, data access service for unified access control and enforcement in Apache Hadoop. RecordService is Apache Licensed open source that we intend to transition to the Apache Software Foundation. In this post, we’ll explain the motivation, system architecture,

Read More

How-to: Prepare Your Apache Hadoop Cluster for PySpark Jobs

Categories: CDH Hadoop How-to Spark

Proper configuration of your Python environment is a critical pre-condition for using Apache Spark’s Python API.

One of the most enticing aspects of Apache Spark for data scientists is the API it provides in non-JVM languages for Python (via PySpark) and for R (via SparkR). There are a few reasons that these language bindings have generated a lot of excitement: Most data scientists think writing Java or Scala is a drag,

Read More

Introduction to HDFS Erasure Coding in Apache Hadoop

Categories: Hadoop HDFS

Erasure coding, a new feature in HDFS, can reduce storage overhead by approximately 50% compared to replication while maintaining the same durability guarantees. This post explains how it works.

HDFS by default replicates each block three times. Replication provides a simple and robust form of redundancy to shield against most failure scenarios. It also eases scheduling compute tasks on locally stored data blocks by providing multiple replicas of each block to choose from.

Read More

Untangling Apache Hadoop YARN, Part 1: Cluster and YARN Basics

Categories: Hadoop MapReduce YARN

In this multipart series, fully explore the tangled ball of thread that is YARN.

YARN (Yet Another Resource Negotiator) is the resource management layer for the Apache Hadoop ecosystem. YARN has been available for several releases, but many users still have fundamental questions about what YARN is, what it’s for, and how it works. This new series of blog posts is designed with the following goals in mind:

  • Provide a basic understanding of the components that make up YARN
  • Illustrate how a MapReduce job fits into the YARN model of computation.

Read More