Cloudera Engineering Blog · How-to Posts

How-to: Get Started with Sentry in Hive

A quick on-ramp (and demo) for using the new Sentry module for RBAC in conjunction with Hive

One attribute of the Enterprise Data Hub is fine-grained access to data by users and apps. This post about supporting infrastructure for that goal was originally published at We republish it here for your convenience.

How-to: Use the HBase Thrift Interface, Part 2: Inserting/Getting Rows

The second how-to in a series about using the Apache HBase Thrift API

Last time, we covered the fundamentals about connecting to Thrift via Python. This time, you’ll learn how to insert and get multiple rows at a time. 

Working with Tables

How-to: Index and Search Data with Hue’s Search App

You can use Hue and Cloudera Search to build your own integrated Big Data search app.

In a previous post, you learned how to analyze data using Apache Hive via Hue’s Beeswax and Catalog apps. This time, you’ll see how to make Yelp Dataset Challenge data searchable by indexing it and building a customizable UI with the Hue Search app.

Indexing Data in Cloudera Search

How-to: Shorten Your Oozie Workflow Definitions

While XML is very good for standardizing the way Apache Oozie workflows are written, it’s also known for being very verbose. Unfortunately, that means that for workflows that have many actions, your workflow.xml can easily become quite long and difficult to manage and read. Cloudera is constantly making improvements to address this issue, and in this how-to, you’ll get a quick look at some of the current features and tricks that you can use to help shorten your Oozie workflow definitions.

The Sub-Workflow Action

One of the more interesting action types that Oozie has is the Sub-Workflow Action; it allows you to run another workflow from your workflow. Suppose you have a workflow where you’d like to use the same action multiple times; this is not usually allowed because Oozie workflows are Direct Acyclic Graphs (DAG) and so actions cannot be executed more than once as part of a workflow. However, if you put that action into its own workflow, you can actually call it multiple times from within the same workflow by using the Sub-Workflow Action. So, instead of copying and pasting the same action to be able to use it multiple times (and taking up a lot of extra space), you can just use the Sub-Workflow Action, which could be shorter; it is also easier to maintain because if you ever want to change that action, you only have to do it in one place. You also get the advantage of being able to use that action in other workflows. Of course, you can still put multiple actions in your sub-workflow.

We’re always looking for new ways to improve the usability of Oozie and of the workflow format.

How-to: Add Cloudera Search to Your Cluster using Cloudera Manager

Cloudera Manager 4.7 added support for managing Cloudera Search 1.0. Thus Cloudera Manager users can easily deploy all components of Cloudera Search (including Apache Solr) and manage all related services, just like every other service included in CDH (Cloudera’s distribution of Apache Hadoop and related projects).

In this how-to, you will learn the steps involved in adding Cloudera Search to a Cloudera Enterprise (CDH + Cloudera Manager) cluster.

Installing the SOLR Parcel

How-to: Use HBase Bulk Loading, and Why

Apache HBase is all about giving you random, real-time, read/write access to your Big Data, but how do you efficiently get that data into HBase in the first place? Intuitively, a new user will try to do that via the client APIs or by using a MapReduce job with TableOutputFormat, but those approaches are problematic, as you will learn below. Instead, the HBase bulk loading feature is much easier to use and can insert the same amount of data more quickly.

This blog post will introduce the basic concepts of the bulk loading feature, present two use cases, and propose two examples.

Overview of Bulk Loading

How-to: Use the HBase Thrift Interface, Part 1

There are various way to access and interact with Apache HBase. Most notably, the Java API provides the most functionality. But some people want to use HBase without Java.

Those people have two main options: One is the Thrift interface (the more lightweight and hence faster of the two options), and the other is the REST interface (aka Stargate). A REST interface uses HTTP verbs to perform an action. By using HTTP, a REST interface offers a much wider array of languages and programs that can access the interface. (If you’d like more information about the REST interface, you can go to my series of how-to’s about it.)

How-to: Manage HBase Data via Hue

The following post was originally published by the Hue Team at the Hue blog in a slightly different form.

In this post, we’ll take a look at the new Apache HBase Browser App added in Hue 2.5 and which has improved significantly since then. To get the Hue HBase browser, grab Hue via CDH 4.4 packages, via Cloudera Manager, or build it directly from GitHub.

How-to: Write an EL Function in Apache Oozie

When building complex workflows in Apache Oozie, it is often useful to parameterize them so they can be reused or driven from a script, and more easily maintained. The most common method is via ${VAR} variables. For example, instead of specifying the same NameNode for all of your actions in a given workflow, you can specify something like ${myNameNode}, and then in your file, you would define it like myNameNode=hdfs://localhost:8020.

One of the advantages of that approach is that if you want to change the variable (the NameNode in this example), you only have to change it in one place and subsequently all the actions will use the new value. This can be particularly useful when testing in a dev or staging environment where you can simply change a few variables instead of editing the workflow itself.

How-to: Select the Right Hardware for Your New Hadoop Cluster

One of the first questions Cloudera customers raise when getting started with Apache Hadoop is how to select appropriate hardware for their new Hadoop clusters.

Although Hadoop is designed to run on industry-standard hardware, recommending an ideal cluster configuration is not as easy as delivering a list of hardware specifications. Selecting hardware that provides the best balance of performance and economy for a given workload requires testing and validation. (For example, users with IO-intensive workloads will invest in more spindles per core.)

Newer Posts Older Posts