The Community Effect
Owen O’Malley recently collected and analyzed information in the Apache Hadoop project commit logs and its JIRA repository. That data describes the history of development for Hadoop and the contributions of the individuals who have worked on it.
In the wake of his analysis, Owen wrote a blog post called The Yahoo! Effect. In it, he highlighted the huge amount of work that has gone into Hadoop since the project’s inception, and showed clearly how an early commitment to the project by Yahoo! had contributed to the growth of the platform and of the community.
The crux of Owen’s analysis is captured in Figure 1, which comes from his post:
Figure 1: Cumulative JIRA tickets closed for HADOOP, HDFS and MAPREDUCE
The horizontal axis is a little bit off — Doug Cutting read the MapReduce paper from Google in 2004. Beginning that year, organizations including Overture, the Internet Archive and Yahoo! paid Doug as an independent contractor to add MapReduce and distributed storage to the Apache Nutch web crawler. It is fair to draw an arbitrary line, though: Yahoo! hired Doug as a full-time employee in early 2006 expressly to work on Hadoop, and assembled a team to collaborate with him on the project. That early investment was critical to building the platform that’s become the system of choice for analytical data management.
With no disrespect to Yahoo!, however, the monolithic wall of green in Figure 1 tells a misleading story about the past, present and future of Apache Hadoop.
It’s absolutely correct to note that Yahoo! covered the salaries of contributors in the early years. Five years is an eternity in the tech industry, however, and many of those developers moved on from Yahoo! between 2006 and 2011. If you look at where individual contributors work today — at the organizations that pay them, and at the different places in the industry where they have carried their expertise and their knowledge of Hadoop — the story is much more interesting.
Figure 2: Cumulative patches contributed to core Hadoop: community members by current employer
Figure 2 tells an encouraging story about the current state of Hadoop. Healthy open source projects bring together a diverse group of developers, looking at different problems and concentrating on a wide range of new features. Those developers create a community that collaborates to produce great software. The workloads that interest Microsoft, for example, will be different from those that interest Facebook. The fact that developers can draw on such a wide range of current requirements, and can make Hadoop better broadly — not for a single company, but for an entire industry — is critical. No single vendor can keep up with the large developer base and the broad adoption of a healthy open source project.
That fact is critical. It separates Hadoop and Linux — projects with contributors from across the industry and around the world — from single vendor open source projects like MySQL, JBoss or Berkeley DB (the open source embedded database that my last company built and distributed). Each of these last three was the wholly-owned property of a single vendor. Each had a robust user community, who downloaded and used the software under an open source license. None, however, had a meaningful developer community, creating new features and contributing them to the project. The reason is simple: When a company owns a project, a developer is forced to donate the future commercial value of his or her work to the company in order to make a contribution. Unless you’re an employee of the company, that’s a huge disincentive to participation.
Community projects like Hadoop, by contrast, have no single corporate owner. Individuals and companies are willing to share the cost of developing the software, since they can share in the commercial benefit that the project creates. The talent pool that contributes to Hadoop is both larger and deeper than any single organization could provide.
There’s another important property of robust open source: It spawns complementary projects. In the early days, if you wanted to use Hadoop, you loaded data into the system by hand and coded up Java routines to run in the MapReduce framework. The broad community recognized these problems and invented new projects to address them — Apache Hive and Apache Pig for queries, Apache Flume and Apache Sqoop (both incubating) for data loading, Apache HBase for high-performance record storage and more.
The contributions that Yahoo! made, as shown in Figure 1, represent a legacy of work on the core Apache Hadoop project, but not on the broader ecosystem of projects. That ecosystem has exploded in recent years, and most of the innovation around Hadoop is now happening in new projects. That’s not surprising — as Hadoop has matured, the core platform has stabilized, and the community has concentrated on easing adoption and simplifying use.
Figure 3 shows the percent of new patches committed to the Core Hadoop project specifically, as a percentage of the contributions to the entire Hadoop ecosystem. Early, there was one project and it was the locus of all the work. Over time, the balance has shifted:
Figure 3: Contributions to HADOOP, HDFS and MAPREDUCE as a percentage of total ecosystem contributions
Of course, this work is also done by a diverse group of engineers working for different companies around the world. We can break down total lifetime contributions to the entire ecosystem by current employer:
Figure 4: Lifetime patches contributed for all Hadoop-related projects: community members by current employer
Figure 4 shares some shortcomings with Figure 1. It describes cumulative historical work, not necessarily recent or future contributions. People, not companies, do the work in open source. Over time those people move to new places to take on new challenges. They carry their expertise with them. Over time, too, individual companies increase or decrease their investment in open source projects as their requirements and business interests change.
How are companies in the industry participating in sponsoring the development of the Hadoop ecosystem today? Figure 5 provides a snapshot of new development that’s happened so far in 2011, by the current employer of the developer doing the work:
Figure 5: 2011 patches contributed to Hadoop and ecosystem projects: community member by current employer
Clearly, the pace of innovation, and the breadth and depth of expertise that’s grown up around Hadoop, are excellent news for the core project. No one company sponsors more than a quarter of the new innovation in the Hadoop ecosystem and nearly half of all new patches are sponsored from a long tail of corporate benefactors and freelancers. In fact, I expect this picture to get more interesting over time. Just since the beginning of 2011, established companies like IBM, EMC, Informatica, Oracle and Dell have announced plans to invest in the Hadoop ecosystem in various ways. A great deal of new money and talent will be directed toward Hadoop in the coming years.
The community owes a deep debt of gratitude to Yahoo! for its early investment in Apache Hadoop. Certainly Cloudera does. Our employees and our customers benefit every day from Yahoo!’s decision to fund Doug’s early work and from its ongoing contributions to the platform. It is critical to remember, though, that the Hadoop community is much bigger than any single company.
Bill Joy, founder of Sun Microsystems, has famously said, “Wherever you work, most of the smart people are somewhere else.” The genius of community-based open source is its ability to harness the insight, energy and enthusiasm of smart people across borders and company boundaries.
As Cloudera’s CEO, I’m proud that we participate in a meaningful way in the work that the Apache Software Foundation oversees. The developers on my payroll do tremendous work on Hadoop and kindred projects. I am equally grateful, though, to the engineers at Hortonworks, Yahoo!, Facebook, StumbleUpon, Twitter, LinkedIn and the long list of other organizations that contribute to the ecosystem. It is a remarkable group and a vibrant community.
We collected data for this post from https://issues.apache.org/jira/secure/IssueNavigator.jspa. Figures 2 through 5 depict patches committed as indicated by JIRAs in status “closed” or “resolved” as of the dates shown in the graph. All dates are based on the date when the patch was committed, not contributed. We chose to use commit dates as patches are often changed several times before they are eventually committed. The use of “date committed” creates spikes in Figure 2, due to unusual JIRA handling at the time of the Hadoop project split. The actual JIRA count is accurate, but the high commit activity is an an artifact of that split.
Associations between contributors and current employer are based direct knowledge of the contributor, or on data from LinkedIn or elsewhere on the web. Contributors whose employers could not be determined are grouped under “everyone else.” Charts that refer to “Core Hadoop” are referring to patches committed to MapReduce, HDFS or Hadoop Common. Charts that refer to the Hadoop Ecosystem refer to patches committed to Core Hadoop as well as Pig, Hive, HBase, Whirr, HCatalog, Zookeeper, Mahout, Avro, Sqoop, Flume, Oozie, Bigtop and Hue.