Cloudera Developer Blog · How-to Posts

How-to: Index and Search Data with Hue’s Search App

You can use Hue and Cloudera Search to build your own integrated Big Data search app.

In a previous post, you learned how to analyze data using Apache Hive via Hue’s Beeswax and Catalog apps. This time, you’ll see how to make Yelp Dataset Challenge data searchable by indexing it and building a customizable UI with the Hue Search app.

Indexing Data in Cloudera Search

How-to: Shorten Your Oozie Workflow Definitions

While XML is very good for standardizing the way Apache Oozie workflows are written, it’s also known for being very verbose. Unfortunately, that means that for workflows that have many actions, your workflow.xml can easily become quite long and difficult to manage and read. Cloudera is constantly making improvements to address this issue, and in this how-to, you’ll get a quick look at some of the current features and tricks that you can use to help shorten your Oozie workflow definitions.

The Sub-Workflow Action

One of the more interesting action types that Oozie has is the Sub-Workflow Action; it allows you to run another workflow from your workflow. Suppose you have a workflow where you’d like to use the same action multiple times; this is not usually allowed because Oozie workflows are Direct Acyclic Graphs (DAG) and so actions cannot be executed more than once as part of a workflow. However, if you put that action into its own workflow, you can actually call it multiple times from within the same workflow by using the Sub-Workflow Action. So, instead of copying and pasting the same action to be able to use it multiple times (and taking up a lot of extra space), you can just use the Sub-Workflow Action, which could be shorter; it is also easier to maintain because if you ever want to change that action, you only have to do it in one place. You also get the advantage of being able to use that action in other workflows. Of course, you can still put multiple actions in your sub-workflow.

We’re always looking for new ways to improve the usability of Oozie and of the workflow format.

How-to: Add Cloudera Search to Your Cluster using Cloudera Manager

Cloudera Manager 4.7 added support for managing Cloudera Search 1.0. Thus Cloudera Manager users can easily deploy all components of Cloudera Search (including Apache Solr) and manage all related services, just like every other service included in CDH (Cloudera’s distribution of Apache Hadoop and related projects).

In this how-to, you will learn the steps involved in adding Cloudera Search to a Cloudera Enterprise (CDH + Cloudera Manager) cluster.

Installing the SOLR Parcel

How-to: Use HBase Bulk Loading, and Why

Apache HBase is all about giving you random, real-time, read/write access to your Big Data, but how do you efficiently get that data into HBase in the first place? Intuitively, a new user will try to do that via the client APIs or by using a MapReduce job with TableOutputFormat, but those approaches are problematic, as you will learn below. Instead, the HBase bulk loading feature is much easier to use and can insert the same amount of data more quickly.

This blog post will introduce the basic concepts of the bulk loading feature, present two use cases, and propose two examples.

Overview of Bulk Loading

How-to: Use the HBase Thrift Interface, Part 1

There are various way to access and interact with Apache HBase. Most notably, the Java API provides the most functionality. But some people want to use HBase without Java.

Those people have two main options: One is the Thrift interface (the more lightweight and hence faster of the two options), and the other is the REST interface (aka Stargate). A REST interface uses HTTP verbs to perform an action. By using HTTP, a REST interface offers a much wider array of languages and programs that can access the interface. (If you’d like more information about the REST interface, you can go to my series of how-to’s about it.)

How-to: Manage HBase Data via Hue

The following post was originally published by the Hue Team at the Hue blog in a slightly different form.

In this post, we’ll take a look at the new Apache HBase Browser App added in Hue 2.5 and which has improved significantly since then. To get the Hue HBase browser, grab Hue via CDH 4.4 packages, via Cloudera Manager, or build it directly from GitHub.

How-to: Write an EL Function in Apache Oozie

When building complex workflows in Apache Oozie, it is often useful to parameterize them so they can be reused or driven from a script, and more easily maintained. The most common method is via ${VAR} variables. For example, instead of specifying the same NameNode for all of your actions in a given workflow, you can specify something like ${myNameNode}, and then in your job.properties file, you would define it like myNameNode=hdfs://localhost:8020.

One of the advantages of that approach is that if you want to change the variable (the NameNode in this example), you only have to change it in one place and subsequently all the actions will use the new value. This can be particularly useful when testing in a dev or staging environment where you can simply change a few variables instead of editing the workflow itself.

How-to: Select the Right Hardware for Your New Hadoop Cluster

One of the first questions Cloudera customers raise when getting started with Apache Hadoop is how to select appropriate hardware for their new Hadoop clusters.

Although Hadoop is designed to run on industry-standard hardware, recommending an ideal cluster configuration is not as easy as delivering a list of hardware specifications. Selecting hardware that provides the best balance of performance and economy for a given workload requires testing and validation. (For example, users with IO-intensive workloads will invest in more spindles per core.)

How-to: Achieve Higher Availability for Hue

Few projects within the Apache Hadoop umbrella have as much end-user visibility as Hue, the open source Web UI that makes Hadoop easier to use. Due to the great number of potential end users, it is useful to add a degree of fault tolerance to your deployment. This how-to describes how to achieve higher availability by placing several Hue instances behind a load balancer.

Tutorial

This tutorial demonstrates how to set up high availability by:

How-to: Use Eclipse with MapReduce in Cloudera’s QuickStart VM

One of the common questions I get from students and developers in my classes relates to IDEs and MapReduce: How do you create a MapReduce project in Eclipse and then debug it?

To answer that question, I have created a screencast showing you how, using Cloudera’s QuickStart VM. The QuickStart VM helps developers get started writing MapReduce code without having to worry about software installs and configuration. Everything is installed and ready to go. You can download the image type that corresponds to your preferred virtualization platform.

Newer Posts Older Posts