Cloudera Developer Blog · How-to Posts

New in Cloudera Manager 5.1: Direct Active Directory Integration for Kerberos Authentication

With this new release, setting up a separate MIT KDC for cluster authentication services is no longer necessary.

Kerberos (initially developed by MIT in the 1980s) has been adopted by every major component of the Apache Hadoop ecosystem. Consequently, Kerberos has become an integral part of the security infrastructure for the enterprise data hub (EDH).

How-to: Build Advanced Time-Series Pipelines in Apache Crunch

Learn how creating dataflow pipelines for time-series analysis is a lot easier with Apache Crunch.

In a previous blog post, I described a data-driven market study based on Wikipedia access data and content. I explained how useful it is to combine several public data sources, and how this approach sheds light onto the hidden correlations across Wikipedia pages.

How-to: Create an IntelliJ IDEA Project for Apache Hadoop

Prefer IntelliJ IDEA over Eclipse? We’ve got you covered: learn how to get ready to contribute to Apache Hadoop via an IntelliJ project.

It’s generally useful to have an IDE at your disposal when you’re developing and debugging code. When I first started working on HDFS, I used Eclipse, but I’ve recently switched to JetBrains’ IntelliJ IDEA (specifically, version 13.1 Community Edition).

How-to: Install a Virtual Apache Hadoop Cluster with Vagrant and Cloudera Manager

It’s been a while since we provided a how-to for this purpose. Thanks, Daan Debie (@DaanDebie), for allowing us to re-publish the instructions below (for CDH 5)!

I recently started as a Big Data Engineer at The New Motion. While researching our best options for running an Apache Hadoop cluster, I wanted to try out some of the features available in the newest version of Cloudera’s Hadoop distribution: CDH 5. Of course I could’ve downloaded the QuickStart VM, but I rather wanted to run a virtual cluster, making use of the 16GB of RAM my shiny new 15″ Retina Macbook Pro has ;)

How-to: Easily Do Rolling Upgrades with Cloudera Manager

Unique across all options, Cloudera Manager makes it easy to do what would otherwise be a disruptive operation for operators and users.

For the increasing number of customers that rely on enterprise data hubs (EDHs) for business-critical applications, it is imperative to minimize or eliminate downtime — thus, Cloudera has focused intently on making software upgrades a routine, non-disruptive operation for EDH administrators and users.

How-to: Use Kite SDK to Easily Store and Configure Data in Apache Hadoop

Organizing your data inside Hadoop doesn’t have to be hard — Kite SDK helps you try out new data configurations quickly in either HDFS or HBase.

Kite SDK is a Cloudera-sponsored open source project that makes it easier for you to build applications on top of Apache Hadoop. Its premise is that you shouldn’t need to know how Hadoop works to build your application on it, even though that’s an unfortunately common requirement today (because the Hadoop APIs are low-level; all you get is a filesystem and whatever else you can dream up — well, code up).

How-to: Configure JDBC Connections in Secure Apache Hadoop Environments

Learn how HiveServer, Apache Sentry, and Impala help make Hadoop play nicely with BI tools when Kerberos is involved.

In 2010, I wrote a simple pair of blog entries outlining the general considerations behind using Apache Hadoop with BI tools. The Cloudera partner ecosystem has positively exploded since then, and the technology has matured as well. Today, if JDBC is involved, all the pieces needed to expose Hadoop data through familiar BI tools are available:

How-to: Process Time-Series Data Using Apache Crunch

Did you know that using the Crunch API is a powerful option for doing time-series analysis?

Apache Crunch is a Java library for building data pipelines on top of Apache Hadoop. (The Crunch project was originally founded by Cloudera data scientist Josh Wills.) Developers can spend more time focused on their use case by using the Crunch API to handle common tasks such as joining data sets and chaining jobs together in a pipeline. At Cloudera, we are so enthusiastic about Crunch that we have included it in CDH 5! (You can get started with Apache Crunch here and here.)

How-to: Use the ShareLib in Apache Oozie (CDH 5)

The internals of Oozie’s ShareLib have changed recently (reflected in CDH 5.0.0). Here’s what you need to know.

In a previous blog post about one year ago, I explained how to use the Apache Oozie ShareLib in CDH 4. Since that time, things have changed about the ShareLib in CDH 5 (particularly directory structure), so some of the previous information is now obsolete. (These changes went upstream under OOZIE-1619.) 

How-to: Run a Simple Apache Spark App in CDH 5

Getting started with Spark (now shipping inside CDH 5) is easy using this simple example.

(Editor’s note – this post has been updated to reflect CDH 5.1/Spark 1.0)

Older Posts