Cloudera Engineering Blog · HDFS Posts
For several good reasons, 2013 is a Happy New Year for Apache Hadoop enthusiasts.
In 2012, we saw continued progress on developing the next generation of the MapReduce processing framework (MRv2), work that will bear fruit this year. HDFS experienced major progress toward becoming a lights-out, fully enterprise-ready distributed filesystem with the addition of high availability features and increased performance. And a hint of the future of the Hadoop platform was provided with the Beta release of Cloudera Impala, a real-time query engine for analytics across HDFS and Apache HBase data.
At Cloudera, we put great pride into drinking our own champagne. That pride extends to our support team, in particular.
Cloudera Manager, our end-to-end management platform for CDH (Cloudera’s open-source, enterprise-ready distribution of Apache Hadoop and related projects), has a feature that allows subscription customers to send a snapshot of their cluster to us. When these cluster snapshots come to us from customers, they end up in a CDH cluster at Cloudera where various forms of data processing and aggregation can be performed.
A few weeks back, Cloudera announced CDH 4.1, the latest update release to Cloudera’s Distribution including Apache Hadoop. This is the first release to introduce truly standalone High Availability for the HDFS NameNode, with no dependence on special hardware or external software. This post explains the inner workings of this new feature from a developer’s standpoint. If, instead, you are seeking information on configuring and operating this feature, please refer to the CDH4 High Availability Guide.
Since the beginning of the project, HDFS has been designed around a very simple architecture: a master daemon, called the NameNode, stores filesystem metadata, while slave daemons, called DataNodes, store the filesystem data. The NameNode is highly reliable and efficient, and the simple architecture is what has allowed HDFS to reliably store petabytes of production-critical data in thousands of clusters for many years; however, for quite some time, the NameNode was also a single point of failure (SPOF) for an HDFS cluster. Since the first beta release of CDH4 in February, this issue has been addressed by the introduction of a Standby NameNode, which provides automatic hot failover capability to a backup. For a detailed discussion of the design of the HA NameNode, please refer to the earlier post by my colleague Aaron Myers.
Limitations of NameNode HA in Previous Versions
Update time! As a reminder, Cloudera releases major versions of CDH, our 100% open source distribution of Apache Hadoop and related projects, annually and then updates to CDH every three months. Updates primarily comprise bug fixes but we will also add enhancements. We only include fixes or enhancements in updates that maintain compatibility, improve system stability and still allow customers and users to skip updates as they see fit.
We’re pleased to announce the availability of CDH4.1. We’ve seen excellent adoption of CDH4.0 since it went GA at the end of June and a number of exciting use cases have moved to production. CDH4.1 is an update that has a number of fixes but also a number of useful enhancements. Among them:
We’re getting really close to Strata Conference + Hadoop World 2012 (just over a month away), schedule planning-wise. So you may want to consider adding the tutorials, sessions, and keynotes below to your calendar! (Start times are always subject to change of course.)
The ones listed below are led or co-led by Clouderans, but there is certainly a wide range of attractive choices beyond what you see here. We just want to ensure that you put these particular ones high on your consideration list.
What do you do at Cloudera, and in which Apache project are you involved?
This guest post comes to us courtesy of Gwen Shapira (@gwenshap), a database consultant for The Pythian Group (and an Oracle ACE Director).
Most western countries use street names and numbers to navigate inside cities. But in Japan, where I live now, very few streets have them.
Organizations in diverse industries have adopted Apache Hadoop-based systems for large-scale data processing. As a leading force in Hadoop development with customers in half of the Fortune 50 companies, Cloudera is in a unique position to characterize and compare real-life Hadoop workloads. Such insights are essential as developers, data scientists, and decision makers reflect on current use cases to anticipate technology trends.
Recently we collaborated with researchers at UC Berkeley to collect and analyze a set of Hadoop traces. These traces come from Cloudera customers in e-commerce, telecommunications, media, and retail (Table 1). Here I will explain a subset of the observations, and the thoughts they triggered about challenges and opportunities in the Hadoop ecosystem, both present and in the future.
As I mentioned in my inaugural post last week, it’s important to shine a spotlight on the Cloudera engineers who have a hand in making the Hadoop projects run. It’s an obvious point, and yet an overlooked one, that a community is an aggregation of individual personalities who have diverse backgrounds and interests yet a shared passion for the group and its goals. As Jono Bacon puts it in his seminal 2009 book The Art of Community, “The building blocks of a community are its teams, and the material that makes these blocks are people.”
In June 2012, Eli Collins (@elicollins), from Cloudera’s Platforms team, led a session at QCon New York 2012 on the subject “Introducing Apache Hadoop: The Modern Data Operating System.” During the conference, the QCon team had an opportunity to interview Eli about several topics, including important things to know about CDH4, main differences between MapReduce 1.0 and 2.0, Hadoop use cases, and more. It’s a great primer for people who are relatively new to Hadoop.
You can catch the full interview (video and transcript versions) here.