Meet the Project Founder: Doug Cutting (First in a Series)

Categories: Avro Community Hadoop Meet the Engineer

ToddAt Cloudera, there is a long and proud tradition of employees creating new open source projects intended to help fill gaps in platform functionality (in addition to hiring new employees who have done so in the past). In fact, more than a dozen ecosystem projects — including Apache Hadoop itself — were founded by Clouderans, more than can be attributed to employees of any other single company. Cloudera was also the first vendor to ship most of those projects as enterprise-ready bits inside its platform.

We thought you might be interested in meeting some of them over the next few months, in a new “Meet the Project Founder” series. It’s only appropriate that we begin with Doug Cutting himself – Cloudera’s chief architect and the quadruple-threat founder of Apache Lucene, Apache Nutch, Apache Hadoop, and Apache Avro.

What led you to your project idea(s)?

I wrote Lucene initially in 1997 when I had an idea for a different way to implement a text indexing and search library. (I’d written three of them previously.) My day job didn’t need it, so I wrote it on my own time and open-sourced it a few years later.

Founders are rarely the future of open source projects; recruiting new contributors is their lifeblood.

I started Nutch at the instigation of Overture in 2002, which thought that an open source web search engine would be good for the world. It funded me part-time to start that project and gave me little other direction, which was awesome!

In 2006, I formed Hadoop by pulling the MapReduce and distributed filesystem code out of Nutch at the request of Yahoo!, which wanted to enhance the distributed computing framework, but already had its own web crawler and search systems.

In 2009, I created the Avro data serialization framework at the suggestion of Raymie Stata, then CTO of Yahoo!, to provide the “glue” that could connect efforts across different parts of Yahoo!. Today, of course, it’s a component in Cloudera’s Distribution for Apache Hadoop (CDH).

Aside from doing the initial commit, what is your definition of the project founder’s role across the lifespan of the project — benevolent dictator, referee, silent partner?

A founder should be like an old man hanging around – but hopefully more wise than cranky. 

Since founders have been there from the start, they understand the motivations underlying the code. When someone proposes a change, a founder can often better see potential avenues for the project that are opened or closed by that change. But founders are rarely the future of open source projects; rather, recruiting new contributors is the lifeblood of those projects. Contributions by founders typically decrease over time, so a founder needs to gently remind the new kids about the project’s past and help them make the right choices for its future.

What has surprised you the most about how your projects have evolved/matured?

My biggest surprise about open source software generally is just how many folks use it. When you create proprietary software, you have to work hard to get each customer. But with open source, folks just start using it. More than 90% of its users are probably people you will never hear from and who never get involved in the project.

Some contributors resent such users since they’re not giving back. But if you demand something in return, then you shouldn’t be contributing to open source — it doesn’t make sense to be selfish about something you’re giving away. Furthermore, “silent” users are also a mark of a project’s success. If folks are able to download the software and use it without reporting bugs or submitting patches, that means the code works and the documentation is sufficient.

That said, you need some people to get involved to create a community that develops the code. That’s usually not too hard if your software is useful.

What is the major work yet to be done, from your perspective as a project founder?

The direction an Apache project takes is determined by those who contribute. As one developer, my ability to determine the future of these projects is thus quite limited.

Patches demonstrate a contributor’s ability as well as their self-knowledge and judgment.

That said, there are areas I hope projects will grow. For Hadoop, I hope it will gain fine-grained scheduling, so that batch and interactive loads can more efficiently share resources. For Avro, I hope it will better integrate with high-level tools, so that folks can peek at Avro data files as naturally as they can text files.

What is your philosophy, if you have one, for balancing quality versus quantity with respect to contributions?

You need enough sustained contribution from someone to get a sense of how they work. If I’ve seen around five patches get committed in a few months without a lot of fuss, I usually feel someone is ready to become a committer.

With patches, folks demonstrate not only their ability but also their self-knowledge and judgment. If someone new to the community proposes big, fundamental changes that are not well thought out, that shows poor judgment. If they try to do something that’s beyond their level of competence, it shows poor self-knowledge.

On the other hand, if they provide well-considered changes in areas they clearly understand, they’ve proven to be someone who will probably be collaborative. When they’ve repeated that process a few times, the pattern is clear.

Some people confuse patch quality with patch depth. If someone makes five high-quality improvements to a trivial part of the system, they deserve to be a committer every bit as much as someone who makes five high-quality contributions to its kernel. What matters is that they know their limitations and are able to peacefully collaborate.

Do you have any other advice for potential project founders?

Write something that solves a problem well enough for folks to start using it. It doesn’t need to be fully optimized but it needs to be fast enough to be useful. It doesn’t need to integrate with every other system in the world, but it does need to integrate enough so that some folks can try it out.

Make it easy for people to get started. APIs should be simple and intuitively named. Documentation and examples should be sufficient so that one can get started in minutes.

You also need to recruit new contributors and users. Users who get helpful responses and contributors who get constructive feedback will hang around and get more involved in the project. If you act like you don’t want their input, then you won’t get their help.

Read other “Meet the Project Founders” installments:

–  Roman Shaposhnik (Apache Bigtop)