Cloudera Developer Blog · How-to Posts
Cloudera Manager includes a new express installation wizard for Amazon Web Services (AWS) EC2. Its goal is to enable Cloudera Manager users to provision CDH clusters and Cloudera Impala (the open source distributed query engine for Apache Hadoop) on EC2 as easily as possible (for testing and development purposes only, not supported for production workloads) - and thus is currently the fastest way to provision a Cloudera Manager-managed cluster in EC2.
The new distinguishing feature introduced in version 4.5 is that Cloudera Manager can now launch and configure the instances for you, so you don’t have to worry about launching the instances, authorizing SSH keys, and configuring a firewall. All this can now be done from within Cloudera Manager!
Since Cloudera Manager and the nodes running CDH use internal hostnames to communicate, the Cloudera Manager server must run on EC2 as well. In fact, the Cloud Express Wizard only appears when installing Cloudera Manager on EC2.
Hue 2.2 , the open source web-based interface that makes Apache Hadoop easier to use, lets you interact with Hadoop services from within your browser without having to go to a command-line interface. It features different applications like an Apache Hive editor and Apache Oozie dashboard and workflow builder.
This post is based on our “Analyzing Twitter Data with Hadoop” sample app and details how the same results can be achieved through Hue in a simpler way. Moreover, all the code and examples of the previous series have been updated to the recent CDH4.2 release.
The first step is to create the “flume” user and his home on the HDFS where the data will be stored. This can be done via the User Admin application.
Hue is an open-source web interface for Apache Hadoop packaged with CDH that focuses on improving the overall experience for the average user. The Apache Oozie application in Hue provides an easy-to-use interface to build workflows and coordinators. Basic management of workflows and coordinators is available through the dashboards with operations such as killing, suspending, or resuming a job.
Prior to Hue 2.2 (included in CDH 4.2), there was no way to manage workflows within Hue that were created outside of Hue. As of Hue 2.2, importing a pre-existing Oozie workflow by its XML definition is now possible.
How to import a workflow
Importing a workflow is pretty straightforward. All it requires is the workflow definition file and access to the Oozie application in Hue. Follow these steps to import a workflow:
- Go to Oozie Editor/Dashboard > Workflows and click the “Import” button.
Apache Oozie, the workflow coordinator for Apache Hadoop, has actions for running MapReduce, Apache Hive, Apache Pig, Apache Sqoop, and
Distcp jobs; it also has a Shell action and a Java action. These last two actions allow us to execute any arbitrary shell command or Java code, respectively.
In this blog post, we’ll look at an example use case and see how to use both the Shell and Java actions in more detail. Please follow along below; you can get a copy of the full project at Cloudera’s GitHub as well. This how-to assumes some basic familiarity with Oozie.
Example Use Case
Suppose we’d like to design a workflow that determines which earthquakes from the last 30 days have a magnitude greater than or equal to that of the largest earthquake in the last hour; also, we’d like to run this workflow every hour. One last requirement for our workflow is that in order to save bandwidth and time, we’d like to be able to skip downloading and processing the 30 days of earthquake data if there were no “large” earthquakes within the last hour; because “large” is subjective, we’ll just go with 3.2 for this example but we should make this easy to configure.
There are various ways to access and interact with Apache HBase. The Java API provides the most functionality, but many people want to use HBase without Java.
There are two main approaches for doing that: One is the Thrift interface, which is the faster and more lightweight of the two options. The other way to access HBase is using the REST interface, which uses HTTP verbs to perform an action, giving developers a wide choice of languages and programs to use.
This series of how-to’s will discuss the REST interface and provide Python code samples for accessing it. The first post will cover HBase REST, some Python caveats, and table administration. The second post will explain how to insert multiple rows at a time using XML and JSON. The third post will show how to get multiples rows using XML and JSON. The full code samples can be found on my GitHub account.
HBase REST Basics
Last week Cloudera released the 4.5 release of Cloudera Manager, the leading framework for end-to-end management of Apache Hadoop clusters. (Download Cloudera Manager here, and see install instructions here.) Among many other features, Cloudera Manager 4.5 adds support for Apache Hive. In this post, I’ll explain how to set up a Hive server for use with Cloudera Manager 4.5 (and later).
For details about other new features in this release, please see the full release notes:
Hadoop network encryption is a feature introduced in Apache Hadoop 2.0.2-alpha and in CDH4.1.
In this blog post, we’ll first cover Hadoop’s pre-existing security capabilities. Then, we’ll explain why network encryption may be required. We’ll also provide some details on how it has been implemented. At the end of this blog post, you’ll get step-by-step instructions to help you set up a Hadoop cluster with network encryption.
A Bit of History on Hadoop Security
Starting with Apache Hadoop 0.20.20x and available in Hadoop 1 and Hadoop 2 releases (as well as CDH3 and CDH4 releases), Hadoop supports Kerberos-based authentication. This is commonly referred to as Hadoop Security. When Hadoop Security is enabled it requires users to authenticate (using Kerberos) in order to read and write data in HDFS or to submit and manage MapReduce jobs. In addition, all Hadoop services authenticate with each other using Kerberos.
UPDATED 20130424: The new RHadoop treats output to Streaming a bit differently, so
do.trace=FALSE must be set in the
UPDATED 20130408: Antonio Piccolboni, the author of RHadoop, has improved the code somewhat using his substantially greater experience with R. The most material change is that the latest version of RHadoop can bind multiple calls to keyval correctly.
Internet-scale data sets present a unique challenge to traditional machine-learning techniques, such as fitting random forests or “bagging“. In order to fit a classifier to a large data set, it’s common to generate many smaller data sets derived from the initial large data set (i.e.,resampling). There are two reasons for this:
In my previous post, you learned how to write a basic MapReduce job and run it on Apache Hadoop. In this post, we’ll delve deeper into MapReduce programming and cover some of the framework’s more advanced features. In particular, we’ll explore:
It’s widely accepted that you should never design or implement your own cryptographic algorithms but rather use well-tested, peer-reviewed libraries instead. The same can be said of distributed systems: Making up your own protocols for coordinating a cluster will almost certainly result in frustration and failure.
Architecting a distributed system is not a trivial problem; it is very prone to race conditions, deadlocks, and inconsistency. Making cluster coordination fast and scalable is just as hard as making it reliable. That’s where Apache ZooKeeper, a coordination service that gives you the tools you need to write correct distributed applications, comes in handy.
With ZooKeeper, these difficult problems are solved once, allowing you to build your application without trying to reinvent the wheel. ZooKeeper is already used by Apache HBase, HDFS, and other Apache Hadoop projects to provide highly-available services and, in general, to make distributed programming easier. In this blog post you’ll learn how you can use ZooKeeper to easily and safely implement important features in your distributed software.