Cloudera Blog · Oozie Posts
We’re very happy to announce the 2.3 release of Hue, the open source Web UI that makes Apache Hadoop easier to use.
Hue 2.3 comes only two months after 2.2 but contains more than 100 improvements and fixes. In particular, two new apps were added (including an Apache Pig editor) and the query editors are now easier to use.
Here’s a video demoing the major changes:
Hue is an open-source web interface for Apache Hadoop packaged with CDH that focuses on improving the overall experience for the average user. The Apache Oozie application in Hue provides an easy-to-use interface to build workflows and coordinators. Basic management of workflows and coordinators is available through the dashboards with operations such as killing, suspending, or resuming a job.
Prior to Hue 2.2 (included in CDH 4.2), there was no way to manage workflows within Hue that were created outside of Hue. As of Hue 2.2, importing a pre-existing Oozie workflow by its XML definition is now possible.
How to import a workflow
Importing a workflow is pretty straightforward. All it requires is the workflow definition file and access to the Oozie application in Hue. Follow these steps to import a workflow:
- Go to Oozie Editor/Dashboard > Workflows and click the “Import” button.
Apache Oozie, the workflow coordinator for Apache Hadoop, has actions for running MapReduce, Apache Hive, Apache Pig, Apache Sqoop, and
Distcp jobs; it also has a Shell action and a Java action. These last two actions allow us to execute any arbitrary shell command or Java code, respectively.
In this blog post, we’ll look at an example use case and see how to use both the Shell and Java actions in more detail. Please follow along below; you can get a copy of the full project at Cloudera’s GitHub as well. This how-to assumes some basic familiarity with Oozie.
Example Use Case
Suppose we’d like to design a workflow that determines which earthquakes from the last 30 days have a magnitude greater than or equal to that of the largest earthquake in the last hour; also, we’d like to run this workflow every hour. One last requirement for our workflow is that in order to save bandwidth and time, we’d like to be able to skip downloading and processing the 30 days of earthquake data if there were no “large” earthquakes within the last hour; because “large” is subjective, we’ll just go with 3.2 for this example but we should make this easy to configure.
Hue lets you interact with Hadoop services from within your browser without having to go to a command-line interface. It features a file browser for HDFS, an Apache Oozie Application for creating workflows of data processing jobs, a job designer/browser for MapReduce, Apache Hive and Cloudera Impala query editors, a Shell, and a collection of Hadoop APIs.
The goal of this release was to add a set of new features and improve the user experience. Read on for a list of the major changes (from 304 commits).
Our thanks to guest author Jon Natkins (@nattyice) of WibiData for the following post!
Today, many (if not most) companies have ETL or data enrichment jobs that are executed on a regular basis as data becomes available. In this scenario it is important to minimize the lag time between data being created and being ready for analysis.
CDH, Cloudera’s open-source distribution of Apache Hadoop and related projects, includes a framework called Apache Oozie that can be used to design complex job workflows and coordinate them to occur at regular intervals. In this how-to, you’ll review a simple Oozie coordinator job, and learn how to schedule a recurring job in Hadoop. The example involves adding new data to a Hive table every hour, using Oozie to schedule the execution of recurring Hive scripts. (For the full context of the example, see the “Analyzing Twitter Data with Apache Hadoop” series.)
Adding Data to Hive Tables
For several good reasons, 2013 is a Happy New Year for Apache Hadoop enthusiasts.
In 2012, we saw continued progress on developing the next generation of the MapReduce processing framework (MRv2), work that will bear fruit this year. HDFS experienced major progress toward becoming a lights-out, fully enterprise-ready distributed filesystem with the addition of high availability features and increased performance. And a hint of the future of the Hadoop platform was provided with the Beta release of Cloudera Impala, a real-time query engine for analytics across HDFS and Apache HBase data.
Let’s look at the highlights of the 2012 developments around projects supported by Cloudera.
Apache Hadoop Releases
Hue is a web interface for Apache Hadoop that makes common Hadoop tasks such as running MapReduce jobs, browsing HDFS, and creating Apache Oozie workflows, easier. In this post, we’re going to focus on the dynamic workflow builder that Hue provides for Oozie that will be released in Hue 2.2.0 (For a high-level description of Oozie integration in Hue, see this blog post).
Basic Operations on Actions
The experience of performing basic operations on actions has been simplified (IE: Creating, updating, and deleting a node).
As Apache Oozie, the workflow engine for Apache Hadoop, continues to receive wider adoption from our customers and the community, we’re seeing patterns with respect to the biggest challenges for users. One such point of difficulty is setting up and using Oozie’s ShareLib for allowing JARs to be shared by different workflows. This blog post is intended to help you with those tasks.
A missing or improperly installed ShareLib will cause some action types (DistCp, Streaming, Pig, Sqoop, and Hive) to fail. In this case, you’ll typically see any of the following exceptions in the Oozie and JobTracker logs:
java.lang.ClassNotFoundException: org.apache.hadoop.tools.DistCp java.lang.NoClassDefFoundError: org/apache/pig/Main java.lang.ClassNotFoundException: org.apache.pig.Main java.lang.NoClassDefFoundError: org/apache/sqoop/Sqoop java.lang.ClassNotFoundException: org.apache.sqoop.Sqoop java.lang.NoClassDefFoundError: org/apache/hadoop/hive/cli/CliDriver java.lang.ClassNotFoundException: org.apache.hadoop.hive.cli.CliDriver java.lang.ClassNotFoundException: Class org.apache.hadoop.streaming.PipeMapRunner not found
Hue is a web interface for Apache Hadoop that makes common Hadoop tasks such as running MapReduce jobs, browsing HDFS, and creating Apache Oozie workflows, easier. (To learn more about the integration of Oozie and Hue, see this blog post.) In this post, we’re going to focus on how one of the fundamental components in Hue, Useradmin, has matured.
New User and Permission Features
User and permission management in Hue has changed drastically over the past year. Oozie workflows, Apache Hive queries, and MapReduce jobs can be shared with other users or kept private. Permissions exist at the app level. Access to particular apps can be restricted, as well as certain sections of the apps. For instance, access to the shell app can be restricted, as well as access to the Apache HBase, Apache Pig, and Apache Flume shells themselves. Access privileges are defined for groups and users can be members of one or more groups.
Changes to Users, Groups, and Permissions
Hue now supports authentication against PAM, Spnego, and an LDAP server. Users and groups can be imported from LDAP and be treated like their non-external counterparts. The import is manual and is on a per user/group basis. Users can authenticate using different backends such as LDAP. Using the LDAP authentication backend will allow users to login using their LDAP password. This can be configured in /etc/hue/hue.ini by changing the ‘desktop.auth.backend’ setting to ‘desktop.auth.backend.LdapBackend’. The LDAP server to authenticate against can be configured through the settings under ‘desktop.ldap’.
This is the second article in a series about analyzing Twitter data using some of the components of the Hadoop ecosystem available in CDH, Cloudera’s open-source distribution of Apache Hadoop and related projects. In the first article, you learned how to pull CDH components together into a single cohesive application, but to really appreciate the flexibility of each of these components, we need to dive deeper.
Every story has a beginning, and every data pipeline has a source. So, to build Hadoop applications, we need to get data from a source into HDFS.
Apache Flume is one way to bring data into HDFS using CDH. The Apache Flume website describes Flume as “a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data.” At the most basic level, Flume enables applications to collect data from its origin and send it to a resting location, such as HDFS. At a slightly more detailed level, Flume achieves this goal by defining dataflows consisting of three primary structures: sources, channels and sinks. The pieces of data that flow through Flume are called events, and the processes that run the dataflow are called agents.