Cloudera Developer Blog · HDFS Posts
With the recent release of CDH3b2, many users are more interested than ever to try out Cloudera’s Distribution for Hadoop (CDH). One of the questions we often hear is, “what does it take to migrate?”.
If you’re not familiar with CDH3b2, here’s what you need to know.
Hadoop has emerged as an indispensable component of any data-intensive enterprise infrastructure. In many ways, working with large datasets on a distributed computing platform (powered by commodity hardware or cloud infrastructure) has never been easier. But because customers are running clusters consisting of hundreds or thousands of nodes, and are processing massive quantities of data from production systems every hour, the logistics of efficient platform utilization can quickly become overwhelming.
To deal with this challenge, the Yahoo! engineering team created Oozie – the Hadoop workflow engine. We are pleased to provide Oozie with Cloudera’s distribution for Hadoop starting with the beta-2 release.
Why create a new workflow system?
Cloudera is happy to announce the availability of the first update to version 2 of our distribution for Hadoop. While major new features are planned for our release of version 3 we will regularly update version 2 with improvements and bug fixes. Check out the change log and release notes for details. You can find the packages and tarballs on our website, or simply update if you are already using our yum and apt repositories.
A notable addition in update 1 is a FUSE package for HDFS. This package allows you to easily mount HDFS as a standard file system for use with traditional Unix utilities. Check out the Mountable HDFS section in the CDH docs and the hadoop-fuse-dfs manpage for details.
While the vast majority of the Hadoop development discussion takes place on the Apache Jira and various project mailing lists, it’s often useful to meet face to face for high bandwidth discussion. To that end, Facebook hosted the first Apache Hadoop contributors meeting yesterday at their campus in Palo Alto. Cloudera, Facebook, Yahoo! and the Apache HBase team were well-represented. It was great to see a broad cross section of Hadoop developers in one room. Contributor meetings will be held on a monthly basis, at a rotating location. While any Hadoop project contributor is welcome to attend, the current focus of the meetings is HDFS and MapReduce. The goal of the discussion is to surface and flesh out ideas rather than make decisions, which happens on the development lists. If you’ve got ideas to add check out the meeting notes and continue the discussion.
Sanjay Radia kicked off the meeting with a discussion of development priorities. Hadoop has become a platform and industry standard for data storage and analytics. What advances are most important to users? How do we continue to innovate without disrupting the installed base? Development must maintain and improve the quality that has allowed companies to adopt Hadoop in their production environments. Fortunately there is broad agreement among contributors on development priorities: availability, compatibility, security, scalability and performance.
At the beginning of September, we announced the first release of CDH2, our current testing repository. Packages in our testing repository are recommended for people who want more features and are willing to upgrade as bugs are worked out. Our testing packages pass unit and functional tests but will not have the same “soak time” as our stable packages. A testing release represents a work in progress that will eventually be promoted to stable.
We plan on pushing new packages into the testing repository every 3 to 6 weeks. And it just so happens it is just about 3 weeks after we announced the first testing release. So it must be time for a new one. Here are some of the highlights:
Disclaimer: Cloudera no longer approves of the recommendations in this post. Please see this documentation for configuration recommendations.
One of the things we get a lot of questions about is how to make Hadoop highly available. There is still a lot of work to be done on this front, but we wanted to take a moment and share the best practices from one of our customers. Check out what Paul George has to say about how they keep thier NameNode up at ContextWeb. – Christophe
Last Wednesday, we hosted a Hadoop meetup, and I gave a short talk about the new project split. How does the split change the project’s organization, and what does it mean for end users?
The mailing lists and the source code repositories have been rearranged. For those doing development against Hadoop’s “trunk” branch, compiling Hadoop and using the various components in concert has become more complicated.
There is some confusion about the state of the file append operation in HDFS. It was in, now it’s out. Why was it removed, and when will it be reinstated? This post looks at some of the history behind HDFS capability for supporting file appends.
Early versions of HDFS had no support for an append operation. Once a file was closed, it was immutable and could only be changed by writing a new copy with a different filename. This style of file access actually fits very nicely with MapReduce, where you write the output of a data processing job to a set of new files; this is much more efficient than manipulating the input files that are already in place.
Administrators of HDFS clusters understand that the HDFS metadata is some of the most precious bits they have. While you might have hundreds of terabytes of information stored in HDFS, the NameNode’s metadata is the key that allows this information, spread across several million “blocks” to be reassembled into coherent, ordered files.
The techniques to preserve HDFS NameNode metadata are well established. You should store several copies across many separate local hard drives, as well as at least one remote hard drive mounted via NFS. (To do this, list multiple directories, on separate mount points, in your dfs.name.dir configuration variable.) You should also run the SecondaryNameNode on a separate machine, which will result in further off-machine backups of “checkpointed” HDFS state made on an hourly basis.