Cloudera Developer Blog · MapReduce Posts
Hue is a web interface for Apache Hadoop that makes common Hadoop tasks such as running MapReduce jobs, browsing HDFS, and creating Apache Oozie workflows, easier. (To learn more about the integration of Oozie and Hue, see this blog post.) In this post, we’re going to focus on how one of the fundamental components in Hue, Useradmin, has matured.
New User and Permission Features
User and permission management in Hue has changed drastically over the past year. Oozie workflows, Apache Hive queries, and MapReduce jobs can be shared with other users or kept private. Permissions exist at the app level. Access to particular apps can be restricted, as well as certain sections of the apps. For instance, access to the shell app can be restricted, as well as access to the Apache HBase, Apache Pig, and Apache Flume shells themselves. Access privileges are defined for groups and users can be members of one or more groups.
Changes to Users, Groups, and Permissions
Hue now supports authentication against PAM, Spnego, and an LDAP server. Users and groups can be imported from LDAP and be treated like their non-external counterparts. The import is manual and is on a per user/group basis. Users can authenticate using different backends such as LDAP. Using the LDAP authentication backend will allow users to login using their LDAP password. This can be configured in /etc/hue/hue.ini by changing the ‘desktop.auth.backend’ setting to ‘desktop.auth.backend.LdapBackend’. The LDAP server to authenticate against can be configured through the settings under ‘desktop.ldap’.
Announcing the Kiji Project: An Open Source Framework for Building Big Data Applications with Apache HBase
The following is a guest post from Aaron Kimball, who was Cloudera’s first engineer and the creator of the Apache Sqoop project. He is the Founder and CTO at WibiData, a San Francisco-based company building big data applications.
Our team at WibiData has been developing applications on Hadoop since 2010 and we’ve helped many organizations transform how they use data by deploying Hadoop. HBase in particular has allowed companies of all types to drive their business using scalable, high performance storage. Organizations have started to leverage these capabilities for various big data applications, including targeted content, personalized recommendations, enhanced customer experience and social network analysis.
While building many of these applications, we have seen emerging tools, design patterns and best practices repeated across projects. One of the clear lessons learned is that Hadoop and HBase provide very low-level interfaces. Each large-scale application we have built on top of Hadoop has required a great deal of scaffolding and data management code. This repetitive programming is tedious, error-prone, and makes application interoperability more challenging in the long run.
Today we bring you a brief interview with Alex Holmes, author of the new book, Hadoop in Practice (Manning). You can learn more about the book and download a free sample chapter here.
There are a few good Hadoop books on the market right now. Why did you decide to write this book, and how is it complementary to them?
When I started working with Hadoop I leaned heavily on Tom White’s excellent book, Hadoop: The Definitive Guide (O’Reilly Media), to learn about MapReduce and how the internals of Hadoop worked. As my experience grew and I started working with Hadoop in production environments I had to figure out how to solve problems such as moving data in and out of Hadoop, using compression without destroying data locality, performing advanced joining techniques and so on. These items didn’t have a lot of coverage in existing Hadoop books, and that’s really the idea behind Hadoop in Practice – it’s a collection of real-world recipes that I learned the hard way over the years.
Hadoop in Practice covers more advanced aspects of working with Hadoop such as MapReduce and HDFS patterns, performance tuning and debugging. The book also looks at how Hadoop can be used as a platform for data science and for data warehousing by studying R integration techniques, and intermediary Pig and Hive recipes. Data mining is another important topic today, and a book on Hadoop isn’t complete without a look at how Mahout lets you run your favorite algorithms at scale.
Earlier this month the Apache Hadoop PMC released Apache Hadoop 2.0.2-alpha, which fixes over 600 issues since the previous release in the 2.0 series, 2.0.1-alpha, back in July. This is a tremendous rate of development, of which all contributors to the project should feel proud.
Some of the more noteworthy changes in this release include:
With CDH4 onward, the Apache Hadoop component introduced two new terms for Hadoop users to wonder about: MR2 and YARN. Unfortunately, these terms are mixed up so much that many people are confused about them. Do they mean the same thing, or not?
This post aims to clarify these two terms.
What is YARN?
YARN stands for “Yet-Another-Resource-Negotiator”. It is a new framework that facilitates writing arbitrary distributed processing frameworks and applications.
API access was a new feature introduced in Cloudera Manager 4.0 (download free edition here.). Although not visible in the UI, this feature is very powerful, providing programmatic access to cluster operations (such as configuration and restart) and monitoring information (such as health and metrics). This article walks through an example of setting up a 4-node HDFS and MapReduce cluster via the Cloudera Manager (CM) API.
Cloudera Manager API Basics
The CM API is an HTTP REST API, using JSON serialization. The API is served on the same host and port as the CM web UI, and does not require an extra process or extra configuration. The API supports HTTP Basic Authentication, accepting the same users and credentials as the Web UI. API users have the same privileges as they do in the web UI world.
You can read the full API documentation here.
Interacting with the API
Organizations in diverse industries have adopted Apache Hadoop-based systems for large-scale data processing. As a leading force in Hadoop development with customers in half of the Fortune 50 companies, Cloudera is in a unique position to characterize and compare real-life Hadoop workloads. Such insights are essential as developers, data scientists, and decision makers reflect on current use cases to anticipate technology trends.
Recently we collaborated with researchers at UC Berkeley to collect and analyze a set of Hadoop traces. These traces come from Cloudera customers in e-commerce, telecommunications, media, and retail (Table 1). Here I will explain a subset of the observations, and the thoughts they triggered about challenges and opportunities in the Hadoop ecosystem, both present and in the future.
Table 1. Summary of Hadoop workloads analyzed
The following is a guest post kindly offered by Adam Kawa, a 26-year old Hadoop developer from Warsaw, Poland. This post was originally published in a slightly different form at his blog, Hakuna MapData!
Recently I have found an interesting dataset, called Million Song Dataset (MSD), which contains detailed acoustic and contextual data about a million songs. For each song we can find information like title, hotness, tempo, duration, danceability, and loudness as well as artist name, popularity, localization (latitude and longitude pair), and many other things. There are no music files included here, but the links to MP3 song previews at 7digital.com can be easily constructed from the data.
The dataset consists of 339 tab-separated text files. Each file contains about 3,000 songs and each song is represented as one separate line of text. The dataset is publicly available and you can find it at Infochimps or Amazon S3. Since the total size of this data sums up to around 218GB, processing it using one machine may take a very long time.
In June 2012, Eli Collins (@elicollins), from Cloudera’s Platforms team, led a session at QCon New York 2012 on the subject “Introducing Apache Hadoop: The Modern Data Operating System.” During the conference, the QCon team had an opportunity to interview Eli about several topics, including important things to know about CDH4, main differences between MapReduce 1.0 and 2.0, Hadoop use cases, and more. It’s a great primer for people who are relatively new to Hadoop.
You can catch the full interview (video and transcript versions) here.
We are happy to announce the general availability of CDH3 update 5. This update is a maintenance release of CDH3 platform and provides a considerable amount of bug-fixes and stability enhancements. Alongside these fixes, we have also included a few new features, most notable of which are the following: