More on Cloudera’s Distribution including Apache Hadoop 3

Categories: General
A week ago we announced two significant product updates: a substantial functional update (doubling the number of components) to Cloudera’s Distribution including Apache Hadoop (CDH) and the launch of Cloudera Enterprise.  I wanted to delve a bit deeper into the first announcement regarding Cloudera’s Distribution including Apache Hadoop version 3 (CDH3).  This post will actually serve to kick off a series of posts that go into progressively more detail about different aspects of CDH3.

Cloudera has been in the Hadoop business for nearly two years now, which doesn’t sound like a long time until you put it in context and realize Hadoop has only existed for twice that long. We did a tally recently and figured we employ more than 60 person-years of collective Hadoop experience.  Coupled with more than 40 paying customers and tens of thousands of downloads of our distribution each month, we’ve had a good vantage point from which to see how Hadoop gets used in the real world.   We learned a few things in this time that informed our latest update to CDH.

We saw: an ever diversifying set of Hadoop applications

Every week we see an expansion in the number and type of applications that are well suited to Hadoop: from simple query & analysis to machine learning to click stream analysis to data transformation. Hadoop has already proven itself relevant in industries spanning high tech, financial services, web, telecommunications, manufacturing, pharmaceuticals, utilities and media.  We feel we have only scratched the surface of what is possible.

We concluded: Hadoop is a multi-application platform for data intensive applications.

More and more of Cloudera’s customers are moving from single use case applications to deploying Hadoop as general infrastructure to run multiple applications.  Hadoop is not a tool or add-on so much as it is a platform unto itself.

We saw: a rapidly expanding ecosystem of Hadoop technologies and frameworks.

Those who are familiar with Hadoop are familiar with popular job authoring frameworks like Hive and Pig, or high speed clients like HBase.  In fact, such frameworks generate the overwhelming volume of Hadoop workloads in most production situations.  These components also provide the most common interfaces by which other technologies integrate with Hadoop.  For example business intelligence tools interface with Hadoop via Hive drivers, and data warehouses and databases interface with Hadoop via the Sqoop framework.

We concluded: Hadoop “core” (Mapreduce and HDFS) is the kernel of the Hadoop data platform.

Like any kernel it has a central role and is responsible for the most vital functions of the platform.  But users and technologies rarely interact directly with the kernel.  It is too low level for most users to have a productive experience with it, and it is too central to allow dozens of adjacent technologies to access it directly.  The community has been fleshing out the Hadoop-based data management platform.  This latest update to CDH formalizes what had already been the case.

We saw: piecing together the platform for each deployment was costly and distracting.

The Hadoop community is innovative, fast moving and by design decentralized.  These are all very positive traits, but one byproduct has been a significant amount of complexity imposed on organizations who want to use Hadoop.  Every Hadoop component has its own schedule, with some components releasing 5 times as often as others.  Every component also has its own dependencies that are a non-trivial set of installation and upgrade possibilities when you are dealing with nearly a dozen components.  Several speakers of the Hadoop Summit noted there are often 3 of any one component.  For example, there are three job authoring clients, three database integration frameworks, three streaming data collectors, three RPC frameworks, etc.  This is well optimized for users who want to create technology but is very sub-optimal for users who want to adopt technology.  A kit car makes for a fun weekend project, but the next Monday, most people will go to work in a car they drove off a dealer’s lot, supported by an explicit warranty and maintained by a qualified mechanic.

We concluded: CDH3 could represent a step forward for organizations who wanted to harness the Hadoop platform.

By bringing together the Hadoop components already in prevalent use and packaging and testing them in an integrated manner, we’re able to give users a platform that has the functionality to satisfy mainstream use cases and the interfaces to tie Hadoop into mainstream enterprise technologies.  Most importantly we’re able to take a huge amount of complexity out of the lives of the typical Hadoop adopter.

This update does not in any way detract from Cloudera’s commitment to open source and an open platform.  CDH3 is 100% Apache licensed, in beta and available for download here.   We hope you get the opportunity to try this latest version and look forward to your feedback.