Cloudera Developer Blog · How-to Posts
Few projects within the Apache Hadoop umbrella have as much end-user visibility as Hue, the open source Web UI that makes Hadoop easier to use. Due to the great number of potential end users, it is useful to add a degree of fault tolerance to your deployment. This how-to describes how to achieve higher availability by placing several Hue instances behind a load balancer.
This tutorial demonstrates how to set up high availability by:
One of the common questions I get from students and developers in my classes relates to IDEs and MapReduce: How do you create a MapReduce project in Eclipse and then debug it?
To answer that question, I have created a screencast showing you how, using Cloudera’s QuickStart VM. The QuickStart VM helps developers get started writing MapReduce code without having to worry about software installs and configuration. Everything is installed and ready to go. You can download the image type that corresponds to your preferred virtualization platform.
In a prior blog post, Omar explained two important concepts introduced in Cloudera Manager 4.5: Role Groups and Host Templates. In this post, I’ll demonstrate how to use role groups and host templates to easily expand an existing CDH cluster onto heterogeneous hardware. If you haven’t already looked at Omar’s post, I’d recommend doing so before reading this one, as I’ll assume you are familiar with role groups and host templates.
Although these instructions/screenshots are premised on Cloudera Manager 4.5, they are valid for subsequent releases as well.
Initial State and Goal
This post describes enlarging a CDH4 cluster running HDFS and MapReduce from five nodes to 10. Initially, our cluster contains the hosts mikem-old-[1-5].ent.cloudera.com. Each host has a single physical drive storing HDFS data mounted at /data/1/. You can see the count of roles and services in the screenshot below:
This how-to is the third in a series that explores the use of the Apache HBase REST interface. Part 1 covered HBase REST fundamentals, some Python caveats, and table administration. Part 2 showed you how to insert multiple rows simultaneously using XML and JSON. Part 3 below will show how to get multiple rows using XML and JSON.
Getting Rows with XML
GET verb, you can retrieve a single row or a group of rows based on their row keys. (You can read more about the multiple value URL format here.) Here we are going to use the simple wildcard character or asterisk (*) to get all rows that start with a specific string. In this example, we can load every line of Shakespeare’s comedies with “shakespeare-comedies-*”. This also requires that our row key(s) be laid out by “AUTHOR-WORK-LINENUMBER”.
Here is the code for getting and working with the XML output:
Apache Oozie has a Java client and a Java API for submitting and monitoring jobs, but what if you want to use Oozie from another language or a non-Java system? Oozie provides a Web Services API, which is an HTTP REST API. That is, you can do anything with Oozie simply by making requests to the Oozie server over HTTP. In fact, this is how the Oozie client and Oozie Java API themselves talk to the Oozie server.
In this how-to, I’ll explain how the REST API works.
What is REST?
REST (Representational State Transfer) is a stateless architectural style for a client and server to communicate over HTTP. The client typically makes HTTP requests and the server sends back an HTTP response. The Oozie server accepts GET, PUT, and POST requests depending on the command. GET is typically used for commands that are querying the server for information and don’t have any side-effects (e.g. asking for a list of jobs). PUT is typically used for commands that are changing an already existing job (e.g. suspending a job). And POST is used for submitting a job.
Helping users manage hundreds of configurations for the growing family of Apache Hadoop services has always been one of Cloudera Manager’s main goals. Prior to version 4.5, it was possible to set configurations at the service (e.g. hdfs), role type (e.g. all datanodes), or individual role level (e.g. the datanode on machine17). An individual role would inherit the configurations set at the service and role-type levels. Configurations made at the role level would override those from the role-type level. While this approach offers flexibility when configuring clusters, it was tedious to configure subsets of roles in the same way.
In Cloudera Manager 4.5, this issue is addressed with the introduction of role groups. For each role type, you can create role groups and assign configurations to them. The members of those groups then inherit those configurations. For example, in a cluster with heterogeneous hardware, a datanode role group can be created for each host type and the datanodes running on those hosts can be assigned to their corresponding role group. That makes it possible to tweak the configurations for all the datanodes running on the same hardware by modifying the configurations of one role group.
In addition to making it easy to manage configurations of subsets of roles, role groups also make it possible to maintain different configurations for experimentation or managing shared clusters for different users and/or workloads.
Viewing and Editing Role Group Configurations
Contributing to Apache Hadoop or writing custom pluggable modules requires modifying Hadoop’s source code. While it is perfectly fine to use a text editor to modify Java source, modern IDEs simplify navigation and debugging of large Java projects like Hadoop significantly. Eclipse is a popular choice thanks to its broad user base and multitude of available plugins.
This post covers configuring Eclipse to modify Hadoop’s source. (Developing applications against CDH using Eclipse is covered in a different post.) Hadoop has changed a great deal since our previous post on configuring Eclipse for Hadoop development; here we’ll revisit configuring Eclipse for the latest “flavors” of Hadoop. Note that trunk and other release branches differ in their directory structure, feature set, and build tools they use. (The EclipseEnvironment Hadoop wiki page is a good starting point for development on trunk.)
This post covers the following main flavors:
One of the complexities of Apache Hadoop is the need to deploy clusters of servers, potentially on a regular basis. At Cloudera, which at any time maintains hundreds of test and development clusters in different configurations, this process presents a lot of operational headaches if not done in an automated fashion. In this post, I’ll describe an approach to cluster automation that works for us, as well as many of our customers and partners.
At Cloudera engineering, we have a big support matrix: We work on many versions of CDH (multiple release trains, plus things like rolling upgrade testing), and CDH works across a wide variety of OS distros (RHEL 5 & 6, Ubuntu Precise & Lucid, Debian Squeeze, and SLES 11), and complex configuration combinations — highly available HDFS or simple HDFS, Kerberized or non-secure, using YARN or MR1 as the execution framework, etc. Clearly, we need an easy way to spin-up a new cluster that has the desired setup, which we can subsequently use for integration, testing, customer support, demos, and so on.
This concept is not new; there are several other examples of Hadoop cluster automation solutions. For example, Yahoo! has its own infrastructure tools, and you can find publicly available Puppet recipes, with various degrees of completeness and maintenance. Furthermore, there are tools that work only with a particular virtualization environment. However, we needed a solution that is more powerful and easier to maintain.
Data scientists, that peculiar mix of software engineer and statistician, are notoriously difficult to interview. One approach that I’ve used over the years is to pose a problem that requires some mixture of algorithm design and probability theory in order to come up with an answer. Here’s an example of this type of question that has been popular in Silicon Valley for a number of years:
Say you have a stream of items of large and unknown length that we can only iterate over once. Create an algorithm that randomly chooses an item from this stream such that each item is equally likely to be selected.
The first thing to do when you find yourself confronted with such a question is to stay calm. The data scientist who is interviewing you isn’t trying to trick you by asking you to do something that is impossible. In fact, this data scientist is desperate to hire you. She is buried under a pile of analysis requests, her ETL pipeline is broken, and her machine learning model is failing to converge. Her only hope is to hire smart people such as yourself to come in and help. She wants you to succeed.
This how-to is the second in a series that explores the use of the Apache HBase REST interface. Part 1 covered HBase REST fundamentals, some Python caveats, and table administration. Part 2 below will show you how to insert multiple rows at once using XML and JSON. The full code samples can be found on GitHub.
Adding Rows With XML
The REST interface would be useless without the ability to add and update row values. The interface gives us this ability with the
POST verb. By posting new rows, we can add new rows or update existing rows using the same row key.
First, let’s step through how to do this using the XML and JSON data formats. Let’s start with XML.