Tag Archives: cloudera’s distribution for hadoop

Some News Related to the Apache Hadoop Project

Categories: General

In an announcement on its blog, Yahoo! recently announced plans to stop distributing its own version of Hadoop, and instead to re-focus on improving Apache’s Hadoop releases. This is great news. Currently, many people running Hadoop use patched versions of the Apache Hadoop package that combine features contributed by Yahoo! and others, but may not yet be collectively available in a single Apache release. Different teams working on enhancements have made their changes to distinct branches off of old releases.

Read more

Configuring Security Features in CDH3

Categories: General Hadoop Platform Security & Cybersecurity

Post written by Cloudera Software Engineer Aaron T. Myers.

Apache Hadoop has had methods of doing user authorization for some time. The Hadoop Distributed File System (HDFS) has a permissions model similar to Unix to control file and directory access, and MapReduce has access control lists (ACLs) per job queue to control which users may submit jobs. These authorization schemes allow Hadoop users and administrators to specify exactly who may access Hadoop’s resources.

Read more

Map-Reduce With Ruby Using Apache Hadoop

Categories: Hadoop MapReduce

Guest re-post from Phil Whelan, a large-scale web-services consultant based in Vancouver, BC.

Map-Reduce With Hadoop Using Ruby
Here I demonstrate, with repeatable steps, how to fire-up a Hadoop cluster on Amazon EC2, load data onto the HDFS (Hadoop Distributed File-System), write map-reduce scripts in Ruby and use them to run a map-reduce job on your Hadoop cluster. You will not need to ssh into the cluster, as all tasks are run from your local machine.

Read more

A profile of Apache Hadoop MapReduce computing efficiency (continued)

Categories: Community Guest Hadoop MapReduce

Guest post from Paul Burkhardt, a Research Developer at SRA International, Inc. where he develops large-scale, distributed computing solutions.

Part II

Previously we proposed how we measure the performance in Hadoop MapReduce applications in an effort to better understand the computing efficiency. In this part, we’ll describe some results and illuminate both good and bad characteristics.

We selected our SIFT-M MapReduce application, described in our presentation at Hadoop World 2010 [3],

Read more

A profile of Apache Hadoop MapReduce computing efficiency

Categories: Community Guest Hadoop MapReduce

Guest post from Paul Burkhardt, a Research Developer at SRA International, Inc. where he develops large-scale, distributed computing solutions.

Part I

We were asked by one of our customers to investigate Hadoop MapReduce for solving distributed computing problems. We were particularly interested in how effectively MapReduce applications utilize computing resources. Computing efficiency is important not only for speed-up and scale-out performance but also power consumption. Consider a hypothetical High-Performance Computing (HPC) system of 10,000 nodes running 50% idle at 50 watts per idle node,

Read more