In this installment of “Meet the Project Founder,” we speak with Josh Wills (@josh_wills), Cloudera’s Senior Director of Data Science and founder of Apache Crunch and Cloudera ML.
What led you to your project idea(s)?
When I first started at Cloudera in 2011, I had a fairly vague job description, no real responsibilities, and wasn’t all that familiar with the Apache Hadoop stack, so I started working on various pet projects in order to learn more about the tools and the use cases in domains like healthcare and energy.
My first project, analyzing adverse drug events, involved lots of Apache Pig programming. I liked Pig’s data flow programming model, but I didn’t enjoy writing user-defined functions: I had to step out of vim, switch to an IDE, read a lot of Javadoc about Pig’s internal data model, code for a while, compile something, switch back to vim, find bugs, go back to the IDE, and so on and so on. To this day, I am a big fan of Apache Hive and Pig right up to the point where I have to write a UDF, and then I die a little bit inside. (That said, the StreamingQuantile function that I adapted from Sawzall and contributed to DataFu is arguably the most useful thing I’ve done at Cloudera.)
My next project was in a field called reflection seismology, which is the art and science of how to determine where oil and natural gas are located underground. The problems involve transforming and processing time-series data, and I wanted to develop a way for geophysicists to execute MapReduce pipelines over this data using familiar tools and create something that was less like a language and more like an application.
I felt that all the pipeline development tools on the Hadoop stack were designed for people who thought about analyzing data using relational techniques, which weren’t really appropriate for time-series analysis. I personally find relational thinking to be limiting when it comes to large-scale data analysis and model building; I like to think in MapReduce and take advantage of its flexibility to help me solve problems more efficiently.
I like to think in MapReduce and take advantage of its flexibility.
What I really wanted was a library that I had used at Google to develop MapReduce pipelines called FlumeJava, which is much closer to bare-metal MapReduce but supports common design patterns like joins and aggregations. Since I didn’t work at Google anymore, I set out to recreate enough of FlumeJava to help me build my time-series application. Much like Goldilocks, it took me three tries to get it right: The first attempt was too simplistic, the second one was over-engineered, and the third was good enough that I wasn’t completely embarrassed to release it publicly as Crunch (now an Apache project).
Aside from doing the initial commit, what is your definition of the project founder’s role across the lifespan of the project? Benevolent dictator, referee, silent partner?
I initially posted the code on Github, and for the first few months, I did most of the work on it with help from Tom White and Brock Noland of Cloudera. Those were the “benevolent dictator” days of moving fast and fixing things, and I really enjoyed them.
The thing that changed that for me was when Gabriel Reid and Christian Tzolov, who were both working at TomTom at the time, started using Crunch in their own projects and making major contributions. Reading through Gabriel’s first pull request, which fixed a subtle bug in the job planner, was one of the most sublime moments of my life. It’s such a wonderful feeling seeing someone you’ve never met and never talked to explore something you created, understand it, and then improve upon it. I realize that sounds corny, but I hope it’s the sort of thing that every project founder experiences, because it’s pretty awesome.
We continued to work together for a number of months, and were joined by some of the folks at WibiData, who did a lot of the work on the Scala code and Apache HBase support. I felt like the project had reached a point where the code was as much Gabriel’s and Christian’s and the Wibis’ as it was mine, and that they should have just as much say in the future of the project as I did. And that was when we got together and decided to take the project to the Apache Software Foundation (ASF).
Within the ASF, I think that the founder’s role is one of servant-leader, just as it would be for any other leader on the project. The VP of an Apache project isn’t a technical leadership role; it’s designed for someone who cares enough about the project to do a disproportionate amount of the tedious, bureaucratic work that has to get done in order to ensure that everyone else has the time and space to work on the pieces of the project that they enjoy most. My role model here is Bryan Duxbury, who was VP of Apache Thrift when we were first bringing Crunch to the Apache Incubator.
What has surprised you the most about how your project has evolved/matured? What is the most important work TBD?
Before I started working on Crunch, I came up with this theory about the best way to develop software. My thought was that the person who wanted to use the software, the eventual end-user, should create a minimally functional version of whatever they wanted to solve their problem. Then, this user should find a bunch of really good engineers and demo the software for them, to show them how it solves the problem. Finally, the user should show these really good engineers the ugly, poorly commented, and obviously sub-optimal source code that the user wrote in order to solve the problem. My hypothesis was that the really good engineers would be so offended by the ugly, poorly commented, and obviously sub-optimal source code that they would immediately set to work fixing it and making it awesome.
The biggest surprise for me was that this theory turned out to actually kind of work, at least in the case of Crunch.
Anyone who makes the time, and has the courage, to submit a patch is a hero in my book.
As far as what’s TBD, I think that finding the right metaphors to extend Crunch’s abstractions to streaming data systems like Apache Storm (newly incubating) or Apache Samza (also newly incubating) in order to make it easier to implement systems based on the lambda architecture are the most interesting new frontier to explore. I have some ideas around the best way to do this, and have been trying to convince some really good engineers to implement them for me, but I suspect it would be better for me to whip up another ugly, poorly commented, and obviously sub-optimal prototype.
What is your philosophy, if you have one, for balancing quality versus quantity with respect to contributions?
I am strongly biased in favor of quantity. Anyone who makes the investment in time and has the courage to submit a patch is a hero in my book, and I always try to accept it or work with them to find an even better way to solve the use case that their patch is trying to solve.
Do you have any other advice for potential project founders?
The key to a successful open source project is recruiting developers, and I there are two very basic things you should do to help you succeed.
First, you should be the end-user of the software you’re creating; Build something for yourself and people like you. You probably know other people like you, and if you build something for yourself, you’re building it for them as well.
Second, be humble: You aren’t the best software engineer in the world, and acting like it isn’t going to make people want to work with you. If you’re not a naturally humble person, self-identify as something other than a software engineer (a data scientist, for example), so that your ego can take it when other people make really obvious improvements to your code.