Why we build our platform on HDFS

Categories: General

It’s not often the case that I have a chance to concur with my colleague E14 over at Hortonworks but his recent blog post gave the perfect opportunity.  I wanted to build on a few of E14’s points and add some of my own.

A recent GigaOm article presented 8 alternatives to HDFS.  They actually missed at least 4 others.  For over a year, Parascale marketed itself as an HDFS alternative (until it became an asset sale to Hitachi).  Appistry continues to market its HDFS alternative.  I’m not sure if it’s released yet but it is very evident that Symantec’s Veritas unit is proposing its Clustered Filesystem (CFS) as an alternative to HDFS as well.  HP Ibrix has also supported the HDFS API for some years now.

The GigaOm article implies that the presence of twelve other vendors promoting alternatives must speak to some deficiencies in HDFS for what else would motivate so many offerings?  This really draws the incorrect conclusion.   I would ask this:

What can we conclude from the fact that there are:

  • Twelve different filesystems promoting themselves as HDFS alternatives,
  • most of the twelve are 6-14 years older than HDFS,
  • yet HDFS today stores overwhelmingly more enterprise data – several hundreds of petabytes industry-wide – than any alternative,
  • and HDFS has the broadest base of large vendor support (Cisco, Dell, HP, IBM, NetApp, Oracle, SAP, SGI, SuperMicro)?

We (Cloudera) conclude that HDFS is in the process of overrunning these legacy filesystems as the industry standard for data management at scale.

In fact we have seen this story before.  If we go back 20 years we can recall a similar situation.  In that  market:

  • There were more than a dozen alternatives.  They went by names like AIX, HP-UX, Solaris, Sequent, Darwin, BSD, SCO and Unixware.
  • Every alternative had long ago reached feature saturation, a reality that enterprise marketers labored to conceal.  Trivia question: does anyone remember the functional difference between SCO and OpenBSD?
  • They were often tightly coupled to expensive proprietary hardware.
  • Their fragmentation made it a nightmare for application developers and hardware manufacturers to target a broad swath of the market with one R&D cycle.

This environment was the tinderbox waiting for the Linux fire.  As Linux grew in maturity and popularity, many Unix vendors tried to fight the trend.  Many marketing and PR dollars were spent to create fear, uncertainty and doubt about this newcomer of an operating system.  But this was futile and over time Linux has gone on to have an outsized impact on the IT industry.  It has led to lower hardware costs by creating a level playing field for all hardware vendors.  It led to more widespread application adoption due to less platform fragmentation.  It also led to a system of shared industry R&D where software, hardware and device manufacturers contribute back to Linux to ensure compatibility.

HDFS is poised to play this role in a market where customers are also tired of fragmentation, excessive margins and breathless marketing of marginal features of dubious utility.  Today proprietary Unix operating systems are still in widespread use.  No doubt the same will hold true for proprietary filesystems.  Old products never die, the just become less relevant.

Eric highlighted HDFS’s economics, data processing bandwidth and reliability.  On a functional level I’ll add it has excellent security, resiliency and high availability (that’s right folks, drop the SPOF claims, you can download CDH4 here!).  Perhaps more important than features for enterprise customers HDFS offers:

  • Choice – Customers get to work with any leading hardware vendor and let the best possible price / performer win the decision, not whatever the vendor decided to bundle in.
  • Portability – It is possible for customers running Hadoop distributions based on HDFS to move between those different distributions without having to reformat the cluster or copy massive amounts of data.  When you’re talking about petabytes of data, this kind of portability is vital.  Without it, your vendor has incredible leverage when it comes time to negotiate the next purchase.
  • Shared industry R&D – We at Cloudera are proud of our employees’ own contributions to HDFS, and they collaborate with their colleagues at Hortonworks.  But today you will find that IBM, Microsoft and VMware are also contributing to HDFS to make it work better with their products.  In the future I predict you’ll find hard drive, networking and server manufacturers also add patches to HDFS to ensure their technologies run optimally with it.

It’s rare when you get to see history repeat itself so completely as it is with HDFS.  Today HDFS may not be the best filesystem for content addressable storage or nearline archive.  But then 15 years ago who would have thought Linux would find its way into laptops, routers, mobile phones and airport kiosks?

Linux drew us the map.  The smart money is already following it.


One response on “Why we build our platform on HDFS