Cloudera Developer Blog · Tools Posts
Our thanks to Concurrent Inc. for the how-to below about using Cascading Pattern with CDH. Cloudera recently tested CDH 4.4 with the Cascading Compatibility Test Suite verifying compatibility with Cascading 2.2.
Cascading Pattern is a machine-learning project within the Cascading development framework used to build enterprise data workflows. Cascading provides an abstraction layer on top of Apache Hadoop and other computing topologies that allows enterprises to leverage existing skills and resources to build data processing applications on Hadoop, without the need for specialized Hadoop skills.
Pattern, in particular, leverages an industry standard called Predictive Model Markup Language (PMML), which allows data scientists to leverage their favorite statistical and analytics tools (such as R, SAS, Oracle, and so on) to export predictive models and quickly run them on data sets stored in Hadoop. Pattern’s benefits include reduced development costs, time savings, and reduced licensing issues at scale – all while leveraging Hadoop clusters, core competencies of analytics staff, and existing intellectual property in the predictive models.
Contributing to Apache Hadoop or writing custom pluggable modules requires modifying Hadoop’s source code. While it is perfectly fine to use a text editor to modify Java source, modern IDEs simplify navigation and debugging of large Java projects like Hadoop significantly. Eclipse is a popular choice thanks to its broad user base and multitude of available plugins.
This post covers configuring Eclipse to modify Hadoop’s source. (Developing applications against CDH using Eclipse is covered in a different post.) Hadoop has changed a great deal since our previous post on configuring Eclipse for Hadoop development; here we’ll revisit configuring Eclipse for the latest “flavors” of Hadoop. Note that trunk and other release branches differ in their directory structure, feature set, and build tools they use. (The EclipseEnvironment Hadoop wiki page is a good starting point for development on trunk.)
This post covers the following main flavors:
Last month, Apache Crunch became the fifth project (along with Sqoop, Flume, Bigtop, and MRUnit) to go from Cloudera’s github repository through the Apache Incubator and on to graduate as a top-level project within the Apache Software Foundation. As the founder of the project and a newly minted Apache VP, I wanted to take this opportunity to express my gratitude to the Crunch community, who have taught me that leadership in the Apache Way means service, humility, and investing more time in building a community than I spend writing code. Working with you all on our shared vision is the highlight of every work week.
Creating Analytical Applications with Crunch: Cloudera ML
The Crunch Java libraries operate at a lower level of abstraction than other tools for creating MapReduce pipelines, like Apache Pig, Apache Hive, or Cascading. Crunch does not make any assumptions about the data model in your pipeline, which makes it easy to create data pipelines over non-relational data sources such as time series, Avro records, and Mahout Vectors. In fact, I originally wrote Crunch while I was working on Seismic Hadoop, a command line tool for processing time series of seismic measurements on Hadoop.
When the data science team sat down with our training team to begin planning our next data science course, we quickly discovered that there weren’t any open-source tools in the Hadoop ecosystem that would allow students to perform the data preparation and model evaluation techniques that we wanted them to learn. For example, it wasn’t possible to quickly summarize a CSV file of numerical and categorical variables via a single MapReduce job, and then use that summary to convert the CSV file into the distributed matrix format that is used as input to many of Mahout’s algorithms. We were also concerned that there wasn’t a lot of guidance as to how to choose values for many of the parameters that Mahout’s algorithms require, and that this might discourage new data scientists from using these models effectively.
API access was a new feature introduced in Cloudera Manager 4.0 (download free edition here.). Although not visible in the UI, this feature is very powerful, providing programmatic access to cluster operations (such as configuration and restart) and monitoring information (such as health and metrics). This article walks through an example of setting up a 4-node HDFS and MapReduce cluster via the Cloudera Manager (CM) API.
Cloudera Manager API Basics
The CM API is an HTTP REST API, using JSON serialization. The API is served on the same host and port as the CM web UI, and does not require an extra process or extra configuration. The API supports HTTP Basic Authentication, accepting the same users and credentials as the Web UI. API users have the same privileges as they do in the web UI world.
You can read the full API documentation here.
Interacting with the API
For those new to it, Cloudera Manager is the first and market-leading management platform for CDH (Cloudera’s Distribution Including Apache Hadoop). Enterprise customers are coming to expect an end-to-end tool that manages the entire lifecycle of their Hadoop operations. In fact, in a recent Cloudera customer survey, an overwhelming 95% emphasized the need for this approach.
Cloudera Manager sets the standard for enterprise deployment by delivering granular visibility into and control over every part of CDH – empowering operators to improve cluster performance, enhance quality of service, increase compliance and reduce administrative costs. We have also a FREE edition to get started, so try it out today! (BTW, for more information on this subject, you can attend a free Webinar on Wednesday, Sept. 19, on the topic “How CBS Interactive Uses Cloudera Manager to Effectively Manage Their Hadoop Cluster”.)
Learn how to configure a basic Maven project that will be able to build applications against CDH
Apache Maven is a build automation tool that can be used for Java projects. Since nearly all the Apache Hadoop ecosystem is written in Java, Maven is a great tool for managing projects that build on top of the Hadoop APIs. In this post, we’ll configure a basic Maven project that will be able to build applications against CDH (Cloudera’s Distribution Including Apache Hadoop) binaries.
Maven projects are defined using an XML file called
pom.xml, which describes things like the project’s dependencies on other modules, the build order, and any other plugins that the project uses. A complete example of the
pom.xml described below, which can be used with CDH, is available on Github. (To use the example, you’ll need at least Maven 2.0 installed.) If you’ve never set up a Maven project before, you can get a jumpstart by using Maven’s quickstart archetype, which generates a small initial project layout. Choose a group ID (typically a top-level package name) and an artifact ID (the name of the project), and execute the following command with the
artifactIdarguments filled in: