Some News Related to the Apache Hadoop Project

In an announcement on its blog, Yahoo! recently announced plans to stop distributing its own version of Hadoop, and instead to re-focus on improving Apache’s Hadoop releases. This is great news. Currently, many people running Hadoop use patched versions of the Apache Hadoop package that combine features contributed by Yahoo! and others, but may not yet be collectively available in a single Apache release. Different teams working on enhancements have made their changes to distinct branches off of old releases. Collecting that work into a single source code package and building a system with the best quality and feature set has been hard work.

New users of Hadoop have generally found this assembly work to be too much trouble. To solve that problem, Cloudera currently distributes a patched version of Apache Hadoop, assembling work from Yahoo!, Cloudera, Facebook and others that has been committed to the Apache project, but not necessarily collectively available in one Apache release.

The Apache Hadoop project contains MapReduce, HDFS and Common. Cloudera packages these along with a number of complementary open-source projects — Apache HBase, Apache Pig, Apache Hive, Apache Zookeeper, Oozie, Flume, Hue, and others — that provide useful services for data management, access and use. Right now, HDFS, MapReduce and Common — the Apache Hadoop packages — are the only packages that we have to ship with a large collection of patches.

You can think of Apache Hadoop as similar to the Linux kernel: the heart of a larger system. In that case, Cloudera acts like Red Hat or Canonical, providing a complete platform that includes both the kernel and the most popular higher-level packages. We assemble & test the combined components, package them for easy installation, certify the integration of complementary systems and provide a predictable release schedule so users can plan upgrades and updates. Cloudera’s Distribution for Apache Hadoop is this larger package. It exists to make the power of Hadoop easily available to a larger audience of users.

We thank Yahoo! for its renewed efforts to make Apache Hadoop releases the very best versions of Hadoop. A more robust and powerful kernel makes the entire ecosystem stronger. One of the strengths of Apache Hadoop ecosystem has been the collective contributions of many organizations and individuals that has added up to hundreds of person-years of engineering investment. That investment dwarfs what any single organization or proprietary vendor could muster and this explains the strength and sophistication of the overall system.

Yahoo!’s commitment to open source development of Hadoop dates to the creation of the project. By concentrating its efforts on the Apache repository, Yahoo! makes a meaningful contribution to everyone in the Apache Hadoop community.  We very much hope that the larger Hadoop community will continue to work in the same way, working together to create excellent Apache releases that everyone can use. Certainly, Cloudera and our customers will benefit from high-quality releases from Apache that require minimal patching for production deployment. We believe that everyone else will, too.

Filed under:

1 Response
  • Praveen / May 23, 2011 / 2:41 AM

    How this work in the in other open source softwares like Linux? Here also many organizations might be contributing patches to different branches which might not be available in a single release of Linux. How is this handled? Do companies like RedHat and Canonical also collect all the patches from different branches into a single release similar to Cloudera? Considering the size of Linux the consolidating task is even more difficult.

    Thanks,
    Praveen

Leave a comment


9 − = seven