Cloudera Engineering Blog · MapReduce Posts
Cloudera is happy to announce the availability of the third update to version 2 of our distribution for Apache Hadoop (CDH2). CDH2 Update 3 contains a number of important fixes like HADOOP-5203, HDFS-1377, MAPREDUCE-1699, MAPREDUCE-1853, and MAPREDUCE-270. Check out the release notes and change log for more details on what’s in this release. You can find the packages and tarballs on our website, or simply update your systems if you are already using our repositories. More instructions can be found in our CDH documentation.
This is a guest repost contributed by Matteo Bertozzi, a Developer at Develer S.r.l.
Apache Hadoop’s SequenceFile provides a persistent data structure for binary key-value pairs. In contrast with other persistent key-value data structures like B-Trees, you can’t seek to a specified key editing, adding or removing it. This file is append-only.
“My library is in the classpath but I still get a Class Not Found exception in a MapReduce job” – If you have this problem this blog is for you.
Java requires third-party and user-defined classes to be on the command line’s “-classpath” option when the JVM is launched. The
hadoop wrapper shell script does exactly this for you by building the classpath from the core libraries located in /usr/lib/hadoop-0.20/ and /usr/lib/hadoop-0.20/lib/ directories. However, with MapReduce you job’s task attempts are executed on remote nodes. How do you tell a remote machine to include third-party and user-defined classes?
Guest re-post from Phil Whelan, a large-scale web-services consultant based in Vancouver, BC.
Here I demonstrate, with repeatable steps, how to fire-up a Hadoop cluster on Amazon EC2, load data onto the HDFS (Hadoop Distributed File-System), write map-reduce scripts in Ruby and use them to run a map-reduce job on your Hadoop cluster. You will not need to ssh into the cluster, as all tasks are run from your local machine. Below I am using my MacBook Pro as my local machine, but the steps I have provided should be reproducible on other platforms running bash and Java.
Guest post from Paul Burkhardt, a Research Developer at SRA International, Inc. where he develops large-scale, distributed computing solutions.
Previously we proposed how we measure the performance in Hadoop MapReduce applications in an effort to better understand the computing efficiency. In this part, we’ll describe some results and illuminate both good and bad characteristics.
Guest post from Paul Burkhardt, a Research Developer at SRA International, Inc. where he develops large-scale, distributed computing solutions.
We were asked by one of our customers to investigate Hadoop MapReduce for solving distributed computing problems. We were particularly interested in how effectively MapReduce applications utilize computing resources. Computing efficiency is important not only for speed-up and scale-out performance but also power consumption. Consider a hypothetical High-Performance Computing (HPC) system of 10,000 nodes running 50% idle at 50 watts per idle node, and assuming 10 cents per kilowatt hour. It would cost $219,000 per year to power just the idle-time. Keeping a large HPC system busy is difficult and requires huge datasets and efficient parallel algorithms. We wanted to analyze Hadoop applications to determine the computing efficiency and gain insight to tuning and optimization of these applications. We installed CDH3 onto a number of different clusters as part of our comparative study. The CDH3 was preferred over the standard Hadoop installation for the recent patches and the support offered by Cloudera. In the first part of this two-part article, we’ll more formally define computing efficiency as it relates to evaluating Hadoop MapReduce applications and describe the performance metrics we gathered for our assessment. The second part will describe our results and conclude with suggestions for improvements and hopefully will instigate further study in Hadoop MapReduce performance analysis.
Fraud has multiple meanings and the term can be easily abused. The definition of fraud has undergone multiple changes throughout the years and is elusive as well as fraud itself. The modern legal definition of fraud usually contains a few elements that have to be proven in court and depends on the state/country. For example, in California, the elements of fraud, which give rise to the fraud cause of action in the California Courts, are: (a) misrepresentation (false representation, concealment, or nondisclosure); (b) knowledge of falsity (or scienter); (c) intent to defraud, i.e., to induce reliance; (d) justifiable reliance; and (e) resulting damage. A more general definition may contain up to 9 elements.
From the statistical or technical perspective, fraud is a rare event that results in a significant financial impact to the organization.
Cloudera’s Apache Hadoop Training and Certification for System Administrators has made it across the Atlantic to London for the first time! This two-day course covers planning, deploying, maintaining, monitoring, and troubleshooting your Hadoop cluster. We’ll talk about HDFS, MapReduce, Apache Hive, Apache Pig, Apache HBase, Flume and more, from the System Administrator’s point of view. Take the certification exam at the end of your training and go home with a valuable validation of your Hadoop knowledge.
Enter the code “london_10pct” when registering and receive a 10% discount!
Our vision for Hadoop World is a conference where both newcomers and experienced Hadoop users can learn and be part of the growing Hadoop community.
We are also offering training sessions for newcomers and experienced Hadoop users alike. Whether you are looking for an Introduction to Hadoop, Hadoop Certification, or you want to learn more about related Hadoop projects we have the training you are looking for.
Apache Hadoop and Apache HBase are gaining popularity due to their flexibility and tremendous work that has been done to simplify their installation and use. This blog is to provide guidance in sizing your first Hadoop/HBase cluster. First, there are significant differences in Hadoop and HBase usage. Hadoop MapReduce is primarily an analytic tool to run analytic and data extraction queries over all of your data, or at least a significant portion of them (data is a plural of datum). HBase is much better for real-time read/write/modify access to tabular data. Both applications are designed for high concurrency and large data sizes. For a general discussions about Hadoop/HBase architecture and differences please refer to Cloudera, Inc. [https://wiki.cloudera.com/display/DOC/Hadoop+Installation+Documentation+for+Cloudera+Enterprise, http://blog.cloudera.com/blog/2010/07/whats-new-in-cdh3-b2-hbase], or Lars George blogs [http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html]. We expect a new edition of the Tom White’s Hadoop book [http://www.hadoopbook.com] and a new HBase book in the near future as well.
With the recent release of CDH3b2, many users are more interested than ever to try out Cloudera’s Distribution for Hadoop (CDH). One of the questions we often hear is, “what does it take to migrate?”.
If you’re not familiar with CDH3b2, here’s what you need to know.
Hadoop has emerged as an indispensable component of any data-intensive enterprise infrastructure. In many ways, working with large datasets on a distributed computing platform (powered by commodity hardware or cloud infrastructure) has never been easier. But because customers are running clusters consisting of hundreds or thousands of nodes, and are processing massive quantities of data from production systems every hour, the logistics of efficient platform utilization can quickly become overwhelming.
To deal with this challenge, the Yahoo! engineering team created Oozie – the Hadoop workflow engine. We are pleased to provide Oozie with Cloudera’s distribution for Hadoop starting with the beta-2 release.
Why create a new workflow system?
Cloudera is happy to announce the availability of the first update to version 2 of our distribution for Hadoop. While major new features are planned for our release of version 3 we will regularly update version 2 with improvements and bug fixes. Check out the change log and release notes for details. You can find the packages and tarballs on our website, or simply update if you are already using our yum and apt repositories.
A notable addition in update 1 is a FUSE package for HDFS. This package allows you to easily mount HDFS as a standard file system for use with traditional Unix utilities. Check out the Mountable HDFS section in the CDH docs and the hadoop-fuse-dfs manpage for details.
While the vast majority of the Hadoop development discussion takes place on the Apache Jira and various project mailing lists, it’s often useful to meet face to face for high bandwidth discussion. To that end, Facebook hosted the first Apache Hadoop contributors meeting yesterday at their campus in Palo Alto. Cloudera, Facebook, Yahoo! and the Apache HBase team were well-represented. It was great to see a broad cross section of Hadoop developers in one room. Contributor meetings will be held on a monthly basis, at a rotating location. While any Hadoop project contributor is welcome to attend, the current focus of the meetings is HDFS and MapReduce. The goal of the discussion is to surface and flesh out ideas rather than make decisions, which happens on the development lists. If you’ve got ideas to add check out the meeting notes and continue the discussion.
Sanjay Radia kicked off the meeting with a discussion of development priorities. Hadoop has become a platform and industry standard for data storage and analytics. What advances are most important to users? How do we continue to innovate without disrupting the installed base? Development must maintain and improve the quality that has allowed companies to adopt Hadoop in their production environments. Fortunately there is broad agreement among contributors on development priorities: availability, compatibility, security, scalability and performance.
At the beginning of September, we announced the first release of CDH2, our current testing repository. Packages in our testing repository are recommended for people who want more features and are willing to upgrade as bugs are worked out. Our testing packages pass unit and functional tests but will not have the same “soak time” as our stable packages. A testing release represents a work in progress that will eventually be promoted to stable.
We plan on pushing new packages into the testing repository every 3 to 6 weeks. And it just so happens it is just about 3 weeks after we announced the first testing release. So it must be time for a new one. Here are some of the highlights:
As Hadoop adoption increases among organizations, companies, and individuals, and as it makes its way into production, testing MapReduce (MR) jobs becomes more and more important. By regularly running tests on your MR jobs–either invoked by developers before they commit a change or by a continuous integration server such as hudson–an engineering organization can catch bugs early, strive for quality, and make developing and maintaining MR jobs easier and faster.
MR jobs are particularly difficult to test thoroughly because they run in a distributed environment. This post will give specific advice on how an engineering team might QA test its MR jobs. Note that Chapter 5 of Hadoop: The Definitive Guide gives specific code examples for testing an MR job.
Last Wednesday, we hosted a Hadoop meetup, and I gave a short talk about the new project split. How does the split change the project’s organization, and what does it mean for end users?
The mailing lists and the source code repositories have been rearranged. For those doing development against Hadoop’s “trunk” branch, compiling Hadoop and using the various components in concert has become more complicated.
The distributed nature of MapReduce programs makes debugging a challenge. Attaching a debugger to a remote process is cumbersome, and the lack of a single console makes it difficult to inspect what is occurring when several distributed copies of a mapper or reducer are running concurrently. Furthermore, operations that work on small amounts of input (e.g., saving the inputs to a reducer in an array) fail when running at scale, causing out-of-memory exceptions or other unintended effects.
A full discussion of how to debug MapReduce programs is beyond the scope of a single blog post, but I’d like to introduce you to a tool we designed at Cloudera to assist you with MapReduce debugging: MRUnit.
This piece is based on the talk “Practical MapReduce” that I gave at Hadoop User Group UK on April 14.
1. Use an appropriate MapReduce language
There are many languages and frameworks that sit on top of MapReduce, so it’s worth thinking up-front which one to use for a particular problem. There is no one-size-fits-all language; each has different strengths and weaknesses.
(guest blog post by Matei Zaharia)
As Hadoop clusters grow in size and data volume, it becomes more and more useful to share them between multiple users and to isolate these users. If User 1 is running a ten-hour machine learning job for example, this should not impair a User 2 from running a 2-minute Hive query. In November, I blogged about how Hadoop 0.19 supports pluggable job schedulers, and how we worked with Facebook to implement a Fair Scheduler for Hadoop using this new functionality. The Fair Scheduler gives each user a configurable share of the cluster when he/she has running jobs, but assigns these resources to other users when the user is inactive. Since last fall, the Fair Scheduler has been picked up by Hadoop users outside Facebook, including the Google/IBM academic Hadoop cluster. It’s also received extensive testing and patches from Yahoo!. Furthermore, we’ve included the Fair Scheduler in Cloudera’s Distribution for Hadoop, where it is integrated right into the JobTracker management UI. Through production experiences, testing, and feedback from users, we’ve made a lot of improvements to the Fair Scheduler, some of which are available now and others which will come out in the next major version, which I’m calling “Fair Scheduler 2.0″. Here is a summary of the upcoming functionality:
- Fair sharing has changed from giving equal shares to each job to giving equal shares to each user. This means that users that submitted many jobs don’t get an advantage over users running a few jobs. It’s also possible to give different weights to different users.
- The fair scheduler now supports killing tasks from other users’ jobs if they are not giving them up. For each pool (by default there is one pool per user, but one can also have specially named pools), there’s a configurable timeout after which it can kill other jobs’ tasks to start running. This means that it’s possible to provide “service guarantees” for production jobs that are sharing a cluster with experimental queries.
- The scheduler can now assign multiple tasks per heartbeat, which is important for maintaining high utilization in large clusters.
- A technique called delay scheduling increases data locality for small jobs, improving performance in a data warehouse workload with many small jobs such as Facebook’s.
- The internal logic has been simplified so that the scheduler can support different scheduling policies within each pool, and in particular we plan to support FIFO pools. Many users have requested FIFO pools because they want to be able to queue up batch workflows on the same cluster that’s running more interactive jobs.
- Many bug fixes and performance improvements were contributed or suggested by a team stress-testing the scheduler at Yahoo!.
- The same team has also contributed Forrest web-based documentation for the fair scheduler (to be available in Hadoop 0.20).
Configuring a Hadoop cluster is something akin to voodoo. There are a large number of variables in hadoop-default.xml that you can override in hadoop-site.xml. Some specify file paths on your system, but others adjust levers and knobs deep inside Hadoop’s guts. Unfortuately, there’s little or no documentation on how to set them well. Is there a single optimal configuration? Are there some settings that can just be “set to 11?”
At Cloudera, we’re working hard to make Hadoop easier to use and to make configuration less painful. Our Hadoop Configuration Tool gives you a web-based guide to help set up your cluster. Once it’s running, though, you might want to look under the hood and tune things a bit.
Editor’s note (added Nov. 9. 2013): Valuable data in an organization is often stored in relational database systems. To access that data, you could use external APIs as detailed in this blog post below, or you could use Apache Sqoop, an open source tool (packaged inside CDH) that allows users to import data from a relational database into Apache Hadoop for further processing. Sqoop can also export those results back to the database for consumption by other clients.
Apache Hadoop’s strength is that it enables ad-hoc analysis of unstructured or semi-structured data. Relational databases, by contrast, allow for fast queries of very structured data sources. A point of frustration has been the inability to easily query both of these sources at the same time. The DBInputFormat component provided in Hadoop 0.19 finally allows easy import and export of data between Hadoop and many relational databases, allowing relational data to be more easily incorporated into your data processing pipeline.
(guest blog post by Matei Zaharia)
When Apache Hadoop started out, it was designed mainly for running large batch jobs such as web indexing and log mining. Users submitted jobs to a queue, and the cluster ran them in order. However, as organizations placed more data in their Hadoop clusters and developed more computations they wanted to run, another use case became attractive: sharing a MapReduce cluster between multiple users. The benefits of sharing are tremendous: with all the data in one place, users can run queries that they may never have been able to execute otherwise, and costs go down because system utilization is higher than building a separate Hadoop cluster for each group. However, sharing requires support from the Hadoop job scheduler to provide guaranteed capacity to production jobs and good response time to interactive jobs while allocating resources fairly between users.
It is common for a MapReduce program to require one or more files to be read by each map or reduce task before execution. For example, you may have a lookup table that needs to be parsed before processing a set of records. To address this scenario, Hadoop’s MapReduce implementation includes a distributed file cache that will manage copying your file(s) out to the task execution nodes.
The DistributedCache was introduced in Hadoop 0.7.0; see HADOOP-288 for more detail on its origins. There is a great deal of existing documentation for the DistributedCache: see the Hadoop FAQ, the MapReduce Tutorial, the Hadoop Javadoc, and the Hadoop Streaming Tutorial. Once you’ve read the existing documentation and understand how to use the DistributedCache, come on back.