The release of Apache Hadoop 2, as announced today by the Apache Software Foundation, is an exciting one for the entire Hadoop ecosystem.
Cloudera engineers have been working hard for many months with the rest of the vast Hadoop community to ensure that Hadoop 2 is the best it can possibly be, for the users of Cloudera’s platform as well as all Hadoop users generally. Hadoop 2 contains many major advances,
A few months ago, my colleague Charles Zedlewski wrote a great piece explaining Apache Hadoop version numbering. The post can be summed up with the following diagram:
While Charles’s post does a great job of explaining the history of Apache Hadoop version numbering, it doesn’t help users understand where Hadoop version numbers are headed.
Hadoop has of late been frequently referred to as “an operating system for the cloud.” Without disputing the accuracy of this description,
Apache Hadoop consists of two primary components: HDFS and MapReduce. HDFS, the Hadoop Distributed File System, is the primary storage system of Hadoop, and is responsible for storing and serving all data stored in Hadoop. MapReduce is a distributed processing framework designed to operate on data stored in HDFS.
HDFS has long been considered a highly reliable file system. An empirical study done at Yahoo! concluded that across Yahoo!’s 20,000 nodes running Apache Hadoop in 10 different clusters in 2009,