Cloudera Blog · How-to Posts
How-to: Configure Eclipse for Hadoop Contributions
- by Karthik Kambatla
- May 15, 2013
- no comments
Contributing to Apache Hadoop or writing custom pluggable modules requires modifying Hadoop’s source code. While it is perfectly fine to use a text editor to modify Java source, modern IDEs simplify navigation and debugging of large Java projects like Hadoop significantly. Eclipse is a popular choice thanks to its broad user base and multitude of available plugins.
This post covers configuring Eclipse to modify Hadoop’s source. (Developing applications against CDH using Eclipse is covered in a different post.) Hadoop has changed a great deal since our previous post on configuring Eclipse for Hadoop development; here we’ll revisit configuring Eclipse for the latest “flavors” of Hadoop. Note that trunk and other release branches differ in their directory structure, feature set, and build tools they use. (The EclipseEnvironment Hadoop wiki page is a good starting point for development on trunk.)
This post covers the following main flavors:
How-to: Automate Your Hadoop Cluster from Java
One of the complexities of Apache Hadoop is the need to deploy clusters of servers, potentially on a regular basis. At Cloudera, which at any time maintains hundreds of test and development clusters in different configurations, this process presents a lot of operational headaches if not done in an automated fashion. In this post, I’ll describe an approach to cluster automation that works for us, as well as many of our customers and partners.
Taming Complexity
At Cloudera engineering, we have a big support matrix: We work on many versions of CDH (multiple release trains, plus things like rolling upgrade testing), and CDH works across a wide variety of OS distros (RHEL 5 & 6, Ubuntu Precise & Lucid, Debian Squeeze, and SLES 11), and complex configuration combinations — highly available HDFS or simple HDFS, Kerberized or non-secure, using YARN or MR1 as the execution framework, etc. Clearly, we need an easy way to spin-up a new cluster that has the desired setup, which we can subsequently use for integration, testing, customer support, demos, and so on.
This concept is not new; there are several other examples of Hadoop cluster automation solutions. For example, Yahoo! has its own infrastructure tools, and you can find publicly available Puppet recipes, with various degrees of completeness and maintenance. Furthermore, there are tools that work only with a particular virtualization environment. However, we needed a solution that is more powerful and easier to maintain.
Algorithms Every Data Scientist Should Know: Reservoir Sampling
Data scientists, that peculiar mix of software engineer and statistician, are notoriously difficult to interview. One approach that I’ve used over the years is to pose a problem that requires some mixture of algorithm design and probability theory in order to come up with an answer. Here’s an example of this type of question that has been popular in Silicon Valley for a number of years:
Say you have a stream of items of large and unknown length that we can only iterate over once. Create an algorithm that randomly chooses an item from this stream such that each item is equally likely to be selected.
The first thing to do when you find yourself confronted with such a question is to stay calm. The data scientist who is interviewing you isn’t trying to trick you by asking you to do something that is impossible. In fact, this data scientist is desperate to hire you. She is buried under a pile of analysis requests, her ETL pipeline is broken, and her machine learning model is failing to converge. Her only hope is to hire smart people such as yourself to come in and help. She wants you to succeed.
How-to: Use the Apache HBase REST Interface, Part 2
This how-to is the second in a series that explores the use of the Apache HBase REST interface. Part 1 covered HBase REST fundamentals, some Python caveats, and table administration. Part 2 below will show you how to insert multiple rows at once using XML and JSON. The full code samples can be found on GitHub.
Adding Rows With XML
The REST interface would be useless without the ability to add and update row values. The interface gives us this ability with the POST verb. By posting new rows, we can add new rows or update existing rows using the same row key.
First, let’s step through how to do this using the XML and JSON data formats. Let’s start with XML.
How-to: Create a CDH Cluster on Amazon EC2 via Cloudera Manager
- by Emanuel Buzek
- March 26, 2013
- 18 comments
Cloudera Manager 4.5 includes a new express installation wizard for Amazon Web Services (AWS) EC2. (This feature is also available in Cloudera Manager Free Edition.) Its goal is to enable Cloudera Manager users to provision CDH clusters and Cloudera Impala (the new open source distributed query engine for Apache Hadoop) on EC2 as easily as possible - and thus is currently the fastest way to provision a Cloudera Manager-managed cluster in EC2.
The new distinguishing feature is that Cloudera Manager can now launch and configure the instances for you, so you don’t have to worry about launching the instances, authorizing SSH keys, and configuring a firewall. All this can now be done from within Cloudera Manager!
Since Cloudera Manager and the nodes running CDH use internal hostnames to communicate, the Cloudera Manager server must run on EC2 as well. In fact, the Cloud Express Wizard only appears when installing Cloudera Manager on EC2.
How-to: Analyze Twitter Data with Hue
Hue 2.2 , the open source web-based interface that makes Apache Hadoop easier to use, lets you interact with Hadoop services from within your browser without having to go to a command-line interface. It features different applications like an Apache Hive editor and Apache Oozie dashboard and workflow builder.
This post is based on our “Analyzing Twitter Data with Hadoop” sample app and details how the same results can be achieved through Hue in a simpler way. Moreover, all the code and examples of the previous series have been updated to the recent CDH4.2 release.
Collecting Data
The first step is to create the “flume” user and his home on the HDFS where the data will be stored. This can be done via the User Admin application.
How-to: Import a Pre-existing Oozie Workflow into Hue
Hue is an open-source web interface for Apache Hadoop packaged with CDH that focuses on improving the overall experience for the average user. The Apache Oozie application in Hue provides an easy-to-use interface to build workflows and coordinators. Basic management of workflows and coordinators is available through the dashboards with operations such as killing, suspending, or resuming a job.
Prior to Hue 2.2 (included in CDH 4.2), there was no way to manage workflows within Hue that were created outside of Hue. As of Hue 2.2, importing a pre-existing Oozie workflow by its XML definition is now possible.
How to import a workflow
Importing a workflow is pretty straightforward. All it requires is the workflow definition file and access to the Oozie application in Hue. Follow these steps to import a workflow:
- Go to Oozie Editor/Dashboard > Workflows and click the “Import” button.
How To: Use Oozie Shell and Java Actions
- by Robert Kanter
- March 18, 2013
- no comments
Apache Oozie, the workflow coordinator for Apache Hadoop, has actions for running MapReduce, Apache Hive, Apache Pig, Apache Sqoop, and Distcp jobs; it also has a Shell action and a Java action. These last two actions allow us to execute any arbitrary shell command or Java code, respectively.
In this blog post, we’ll look at an example use case and see how to use both the Shell and Java actions in more detail. Please follow along below; you can get a copy of the full project at Cloudera’s GitHub as well. This how-to assumes some basic familiarity with Oozie.
Example Use Case
Suppose we’d like to design a workflow that determines which earthquakes from the last 30 days have a magnitude greater than or equal to that of the largest earthquake in the last hour; also, we’d like to run this workflow every hour. One last requirement for our workflow is that in order to save bandwidth and time, we’d like to be able to skip downloading and processing the 30 days of earthquake data if there were no “large” earthquakes within the last hour; because “large” is subjective, we’ll just go with 3.2 for this example but we should make this easy to configure.
How-to: Use the Apache HBase REST Interface, Part 1
There are various ways to access and interact with Apache HBase. The Java API provides the most functionality, but many people want to use HBase without Java.
There are two main approaches for doing that: One is the Thrift interface, which is the faster and more lightweight of the two options. The other way to access HBase is using the REST interface, which uses HTTP verbs to perform an action, giving developers a wide choice of languages and programs to use.
This series of how-to’s will discuss the REST interface and provide Python code samples for accessing it. The first post will cover HBase REST, some Python caveats, and table administration. The second post will explain how to insert multiple rows at a time using XML and JSON. The third post will show how to get multiples rows using XML and JSON. The full code samples can be found on my GitHub account.
HBase REST Basics
How-to: Set Up Cloudera Manager 4.5 for Apache Hive
- by Darren Lo
- March 06, 2013
- no comments
Last week Cloudera released the 4.5 release of Cloudera Manager, the leading framework for end-to-end management of Apache Hadoop clusters. (Download Cloudera Manager here, and see install instructions here.) Among many other features, Cloudera Manager 4.5 adds support for Apache Hive. In this post, I’ll explain how to set up a Hive server for use with Cloudera Manager 4.5.
For details about other new features in this release, please see the full release notes: