Cloudera Engineering Blog · HDFS Posts
We are happy to announce the general availability of CDH3 update 5. This update is a maintenance release of CDH3 platform and provides a considerable amount of bug-fixes and stability enhancements. Alongside these fixes, we have also included a few new features, most notable of which are the following:
HttpFS is an HTTP gateway/proxy for Apache Hadoop FileSystem implementations. HttpFS comes with CDH4 and replaces HdfsProxy (which only provided read access). Its REST API is compatible with WebHDFS (which is included in CDH4 and the upcoming CDH3u5).
HttpFs is a proxy so, unlike WebHDFS, it does not require clients be able to access every machine in the cluster. This allows clients to to access a cluster that is behind a firewall via the WebHDFS REST API. HttpFS also allows clients to access CDH3u4 clusters via the WebHDFS REST API.
Warning: The procedure described below can cause data loss. Contact Cloudera Support before attempting it.
Most system administrators have had to deal with a bad hard disk at some point. One moment, the hard disk is a mechanical marvel; the next, it is an expensive paperweight.
This was originally posted on the Hadoop Summit 2012 blog.
Today’s “Meet the Presenters” interview features two speakers: Aaron Myers from Cloudera and Suresh Srinivas from Hortonworks. Aaron and Suresh will be presenting on HDFS NameNode High Availability, one of the hottest topics in the Apache Hadoop space today.
Question: Tell us about your current role and how you interact with Apache Hadoop?
Apache Hadoop consists of two primary components: HDFS and MapReduce. HDFS, the Hadoop Distributed File System, is the primary storage system of Hadoop, and is responsible for storing and serving all data stored in Hadoop. MapReduce is a distributed processing framework designed to operate on data stored in HDFS.
HDFS has long been considered a highly reliable file system. An empirical study done at Yahoo! concluded that across Yahoo!’s 20,000 nodes running Apache Hadoop in 10 different clusters in 2009, HDFS lost only 650 blocks out of 329 million total blocks. The vast majority of these lost blocks were due to a handful of bugs which have long since been fixed.
Part 1 of this post covered how to convert and store email messages for archival purposes using Apache Hadoop, and outlined how to perform a rudimentary search through those archives. But, let’s face it: for search to be of any real value, you need robust features and a fast response time. To accomplish this we use Solr/Lucene-type indexing capabilities on top of HDFS and MapReduce.
Before getting into indexing within Hadoop, let us review the features of Lucene and Solr:
Apache Lucene and Apache Solr
The Development track at Hadoop World is a technical deep dive dedicated to discussion about Apache Hadoop and application development for Apache Hadoop. You will hear committers, contributors and expert users from various Hadoop projects discuss the finer points of building applications with Hadoop and the related ecosystem. The sessions will touch on foundational topics such as HDFS, HBase, Pig, Hive, Flume and other related technologies. In addition, speakers will address key development areas including tools, performance, bringing the stack together and testing the stack. Sessions in this track are for developers of all levels who want to learn more about upcoming features and enhancements, new tools, advanced techniques and best practices.
Continuing with our practice from Cloudera’s Distribution Including Apache Hadoop v2 (CDH2), our goal is to provide regular (quarterly), predictable updates to the generally available release of our open source distribution. For CDH3 the first such update is available today, approximately 3 months from when CDH3 went GA.
For those of you who are recent Cloudera users, here is a refresh on our update policy:
What is Hoop?
Hoop provides access to all Hadoop Distributed File System (HDFS) operations (read and write) over HTTP/S.
Hoop can be used to:
A common question on the Apache Hadoop mailing lists is what’s going on with availability? This post takes a look at availability in the context of Hadoop, gives an overview of the work in progress and where things are headed.
When discussing Hadoop availability people often start with the NameNode since it is a single point of failure (SPOF) in HDFS, and most components in the Hadoop ecosystem (MapReduce, Apache HBase, Apache Pig, Apache Hive etc) rely on HDFS directly, and are therefore limited by its availability. However, Hadoop availability is a larger, more general issue, so it’s helpful to establish some context before diving in.