Cloudera Developer Blog

Big Data best practices, how-to's, and internals from Cloudera Engineering and the community


How-to: Implement Role-based Security in Impala using Apache Sentry

This quick demo illustrates how easy it is to implement role-based access and control in Impala using Sentry.

Apache Sentry (incubating) is the Apache Hadoop ecosystem tool for role-based access control (RBAC). In this how-to, I will demonstrate how to implement Sentry for RBAC in Impala. I feel this introduction is best motivated by a use case.

Apache ZooKeeper Resilience at Pinterest

The guest post below was originally authored by Pinterest engineer Raghavendra Prabhu and published by the Pinterest Engineering blog. Being big ZooKeeper fans, we re-publish it here for your convenience. Thanks, Pinterest!

Apache ZooKeeper is an open source distributed coordination service that’s popular for use cases like service discovery, dynamic configuration management and distributed locking. While it’s versatile and useful, it has failure modes that can be hard to prepare for and recover from, and if used for site critical functionality, can have a significant impact on site availability.

Inside Apache Oozie HA

Oozie’s new HA qualities help cluster operators sleep well at night. Here’s how it works.

One of the big new features in CDH 5 for Apache Oozie is High Availability (HA). In designing this feature, the Oozie team at Cloudera had two main goals: 1) Don’t change the API or usage patterns, and 2) the user shouldn’t even have to know that HA is enabled. In other words, we wanted Oozie HA to be as easy and transparent as possible. 

Apache Spark: A Delight for Developers

Sure, Spark is fast, but it also gives developers a positive experience they won’t soon forget.

Apache Spark is well known today for its performance benefits over MapReduce, as well as its versatility. However, another important benefit – the elegance of the development experience – gets less mainstream attention.

The Truth About MapReduce Performance on SSDs

Cost-per-performance, not cost-per-capacity, turns out to be the better metric for evaluating the true value of SSDs.

In the Big Data ecosystem, solid-state drives (SSDs) are increasingly considered a viable, higher-performance alternative to rotational hard-disk drives (HDDs). However, few results from actual testing are available to the public.

HBaseCon 2014: Speakers, Keynotes, and Sessions Announced

Users of diverse, real-world HBase deployments around the world present at this year’s event.

This year’s agenda for HBaseCon, the conference for the Apache HBase community (developers, operators, contributors), looks “Stack-ed” with can’t-miss keynotes and breakouts. Program committee, you really came through (again).

Meet the Instructor: Bruce Martin

In this installment of “Meet the Instructor”, our interview subject is Bruce Martin.

What is your role at Cloudera?

This Month in the Ecosystem (February 2014)

Welcome to our sixth edition of “This Month in the Ecosystem,” a digest of highlights from February 2014 (never intended to be comprehensive; for completeness, see the excellent Hadoop Weekly).

February being a short month, the list is relatively short — but never confuse quantity with quality!

A Guide to Checkpointing in Hadoop

Understanding how checkpointing works in HDFS can make the difference between a healthy cluster or a failing one.

Checkpointing is an essential part of maintaining and persisting filesystem metadata in HDFS. It’s crucial for efficient NameNode recovery and restart, and is an important indicator of overall cluster health. However, checkpointing can also be a source of confusion for operators of Apache Hadoop clusters.

Why Apache Spark is a Crossover Hit for Data Scientists

Spark is a compelling multi-purpose platform for use cases that span investigative, as well as operational, analytics.

Data science is a broad church. I am a data scientist — or so I’ve been told — but what I do is actually quite different from what other “data scientists” do. For example, there are those practicing “investigative analytics” and those implementing “operational analytics.” (I’m in the second camp.)

Newer Posts Older Posts