Category Archives: CDH

High Availability for the Hadoop Distributed File System (HDFS)

Categories: CDH Community General Hadoop HDFS

Background

Apache Hadoop consists of two primary components: HDFS and MapReduce. HDFS, the Hadoop Distributed File System, is the primary storage system of Hadoop, and is responsible for storing and serving all data stored in Hadoop. MapReduce is a distributed processing framework designed to operate on data stored in HDFS.

HDFS has long been considered a highly reliable file system.  An empirical study done at Yahoo! concluded that across Yahoo!’s 20,000 nodes running Apache Hadoop in 10 different clusters in 2009,

Read more

Thoughts on Cloudera and Cisco UCS reference architecture for Apache Hadoop

Categories: CDH Cloudera Manager

Cloudera and Cisco jointly announced a reference architecture for running Cloudera’s Distribution Including Apache Hadoop (CDH) and Cloudera Manager on Cisco’s Unified Computing System (UCS) last November. It was the first Apache Hadoop reference architecture assembled by Cisco, and is proudly certified by Cloudera.

I bring a different perspective on the Cloudera-Cisco relationship, as I worked for over five years in Cisco on the software powering the Nexus 5000 series switches and the Cisco Virtual Interface Card.

Read more

Indexing Files via Solr and Java MapReduce

Categories: CDH Cloudera Manager

Several weeks ago, I set about to demonstrate the ease with which Solr and Map/Reduce can be integrated. I was unable to find a simple, yet comprehensive, primer on integrating the two technologies. So I set about to write one.

What follows is my bare-bones tutorial on getting Solr up and running to index each word of the complete works of Shakespeare. Note: Special thanks to Sematext for looking over the Solr bits and making sure they are sane.

Read more

MapReduce 2.0 in Apache Hadoop 0.23

Categories: CDH General Hadoop MapReduce

In Building and Deploying MR2 we presented a brief introduction to MapReduce in Apache Hadoop 0.23 and focused on the steps to set up a single-node cluster. This blog provides developers with architectural details of the new MapReduce design. 

Apache Hadoop 0.23 has major improvements over previous releases. Here are a few highlights on the MapReduce front; note that there are also major HDFS improvements, which are out of scope of this post.

Read more

Introducing CDH4

Categories: CDH General

I’m pleased to inform our users and customers that Cloudera has released its 4th version of Cloudera’s Distribution Including Apache Hadoop (CDH) into beta today. This release combines the input from our enterprise customers, partners and users with the hard work of Cloudera engineering and the larger Apache open source community to create what we believe is a compelling advance for this widely adopted platform.

There are a great many improvements and new capabilities in CDH4 compared to CDH3.

Read more