Cloudera Developer Blog · HDFS Posts
The release of Apache Hadoop 2, as announced today by the Apache Software Foundation, is an exciting one for the entire Hadoop ecosystem.
Cloudera engineers have been working hard for many months with the rest of the vast Hadoop community to ensure that Hadoop 2 is the best it can possibly be, for the users of Cloudera’s platform as well as all Hadoop users generally. Hadoop 2 contains many major advances, including (but not limited to):
One of the key principles behind Apache Hadoop is the idea that moving computation is cheaper than moving data — we prefer to move the computation to the data whenever possible, rather than the other way around. Because of this, the Hadoop Distributed File System (HDFS) typically handles many “local reads” reads where the reader is on the same node as the data:
Initially, local reads in HDFS were handled the same way as remote reads: the client connected to the DataNode via a TCP socket and transferred the data via DataTransferProtocol. This approach was simple, but it had some downsides. For example, the DataNode had to keep a thread around and a TCP socket for each client that was reading a block. There was the overhead of the TCP protocol in the kernel, as well as the overhead of DataTransferProtocol itself. There was room to optimize.
Managing and viewing data in HDFS is an important part of Big Data analytics. Hue, the open source web-based interface that makes Apache Hadoop easier to use, helps you do that through a GUI in your browser — instead of logging into a Hadoop gateway host with a terminal program and using the command line.
The first episode in a new series of Hue demos, the video below demonstrates how to get up and running quickly with HDFS file operations via Hue’s File Browser application.
Last week the Apache Hadoop PMC voted to release Apache Hadoop 2.0.3-alpha, the latest in the Hadoop 2 release series. This release fixes over 500 issues (covering the Common, HDFS, MapReduce and YARN sub-projects) since the 2.0.2-alpha release in October last year. In addition to bug fixes and general improvements the more noteworthy changes include:
For several good reasons, 2013 is a Happy New Year for Apache Hadoop enthusiasts.
In 2012, we saw continued progress on developing the next generation of the MapReduce processing framework (MRv2), work that will bear fruit this year. HDFS experienced major progress toward becoming a lights-out, fully enterprise-ready distributed filesystem with the addition of high availability features and increased performance. And a hint of the future of the Hadoop platform was provided with the Beta release of Cloudera Impala, a real-time query engine for analytics across HDFS and Apache HBase data.
Let’s look at the highlights of the 2012 developments around projects supported by Cloudera.
Apache Hadoop Releases
At Cloudera, we put great pride into drinking our own champagne. That pride extends to our support team, in particular.
Cloudera Manager, our end-to-end management platform for CDH (Cloudera’s open-source, enterprise-ready distribution of Apache Hadoop and related projects), has a feature that allows subscription customers to send a snapshot of their cluster to us. When these cluster snapshots come to us from customers, they end up in a CDH cluster at Cloudera where various forms of data processing and aggregation can be performed.
Today, the system provides real-time support via an application we call Cloudera Support Interface (CSI). When a support employee looks at a ticket, they can use CSI to examine the customer’s latest snapshot and see cluster stats such as version information, number of nodes in service, which services are used, and so on. CSI also visualizes different aggregations and groupings, such as versions, which allows us to detect misconfigured clusters, or issues caused during upgrade or installation.
A few weeks back, Cloudera announced CDH 4.1, the latest update release to Cloudera’s Distribution including Apache Hadoop. This is the first release to introduce truly standalone High Availability for the HDFS NameNode, with no dependence on special hardware or external software. This post explains the inner workings of this new feature from a developer’s standpoint. If, instead, you are seeking information on configuring and operating this feature, please refer to the CDH4 High Availability Guide.
Since the beginning of the project, HDFS has been designed around a very simple architecture: a master daemon, called the NameNode, stores filesystem metadata, while slave daemons, called DataNodes, store the filesystem data. The NameNode is highly reliable and efficient, and the simple architecture is what has allowed HDFS to reliably store petabytes of production-critical data in thousands of clusters for many years; however, for quite some time, the NameNode was also a single point of failure (SPOF) for an HDFS cluster. Since the first beta release of CDH4 in February, this issue has been addressed by the introduction of a Standby NameNode, which provides automatic hot failover capability to a backup. For a detailed discussion of the design of the HA NameNode, please refer to the earlier post by my colleague Aaron Myers.
Limitations of NameNode HA in Previous Versions
As described in the March blog post, NameNode High Availability relies on shared storage - in particular, it requires some place in which to store the HDFS edit log which can be written by the Active NameNode, and simultaneously read by the Standby NameNode. In addition, the shared storage must itself be highly available — if it becomes inaccessible, the Active NameNode will no longer be able to continue taking namespace edits.
Update time! As a reminder, Cloudera releases major versions of CDH, our 100% open source distribution of Apache Hadoop and related projects, annually and then updates to CDH every three months. Updates primarily comprise bug fixes but we will also add enhancements. We only include fixes or enhancements in updates that maintain compatibility, improve system stability and still allow customers and users to skip updates as they see fit.
We’re pleased to announce the availability of CDH4.1. We’ve seen excellent adoption of CDH4.0 since it went GA at the end of June and a number of exciting use cases have moved to production. CDH4.1 is an update that has a number of fixes but also a number of useful enhancements. Among them:
We’re getting really close to Strata Conference + Hadoop World 2012 (just over a month away), schedule planning-wise. So you may want to consider adding the tutorials, sessions, and keynotes below to your calendar! (Start times are always subject to change of course.)
The ones listed below are led or co-led by Clouderans, but there is certainly a wide range of attractive choices beyond what you see here. We just want to ensure that you put these particular ones high on your consideration list.
If you’re interested in community meetups as well, refer to my post from last week on that subject – several are planned.
|An Introduction to Hadoop||Mark Fei||Tues., Oct. 23||9am|
|Using HBase||Amandeep Khurana, Matteo Bertozzi||Tues., Oct. 23||9am|
|Testing Hadoop Applications||Tom Wheeler||Tues., Oct. 23||9am|
|Building a Large-scale Data Collection System Using Flume NG||Hari Shreedharan, Will McQueen, Arvind Prabhakar, Prasad Mujumdar, Mike Percy||Tues., Oct. 23||1:30pm|
|Given Enough Monkeys – Some Thoughts on Randomness||Jesse Anderson||Tues., Oct. 23||3:20pm|
|Keynote: Big Answers||Mike Olson||Weds., Oct. 24||8:55am|
|Large Scale ETL with Hadoop||Eric Sammer||Weds., Oct. 24||11:40am|
|HDFS – What is New and Future||Todd Lipcon (co-presenter)||Weds., Oct. 24||4:10pm|
|High Availability for the HDFS NameNode: Phase 2||Aaron Myers, Todd Lipcon||Weds., Oct. 24||5pm|
|Plenary Session: Beyond Batch||Doug Cutting||Thurs., Oct. 25||9:20am|
|Upcoming Enterprise Features in Apache HBase 0.96||Jon Hsieh||Thurs., Oct. 25||11:40am|
|Data Science on Hadoop: What’s There and What’s Missing||Justin Erickson||Thurs., Oct. 25||1:40pm|
|Taming the Elephant – Learn How Monsanto Manages Their Hadoop Cluster to Enable Genome/Sequence Processing||Bala Venkatrao, Aparna Ramani (with others)||Thurs., Oct. 25||4:10pm|
|Knitting Boar||Josh Patterson, Michael Katzenellenbogen||Thurs., Oct. 25||4:10pm|
What do you do at Cloudera, and in which Apache project are you involved?
For the last year and a half, I’ve been an engineer on the Enterprise team. We’re the guys who build Cloudera Manager, and all the goodies that make it easy to manage and administer Apache Hadoop clusters. Specifically, I’ve worked on a number of things across the product, like scale and performance for the databases underlying the various monitoring tools available in the Enterprise edition of Cloudera Manager. I’ve also worked extensively on our operational reporting and HDFS file search capabilities. While I don’t work full-time on any of the Apache projects, I have been known to contribute to Apache Hive and Hadoop on rainy days.