What is your definition of a “data scientist”?
To put it in boring terms, data scientists are people who find that the bulk of the work for testing their hypotheses lies in manipulating quantities of information – AKA, “San Franciscans with calculators.”
About what parts of the field are you excited in particular?
The proliferation (or “Cambrian Explosion”, as Josh Wills would say) of tools and frameworks for munging and analyzing massive data. A few years ago, we really only had MapReduce for dealing with data at a large scale. Since then, a bunch of players have written engines that can handle more of the data transformation patterns required for complex analytics. I think eventually the market will need standardize on two or three, and I’m curious to see how it plays out.
When it comes to datasets, I’m always trying to find time to play around with traffic data. CalTrans releases the measurements they collect from loop detectors on freeways all over California every 30 seconds, so that’s my go-to whenever I want to try out a new tool or technique.
You previously worked as a software engineer (and are an Apache Hadoop committer). How are the two roles different?
At Cloudera, “data scientist” is a much more external facing role. We don’t have massive data problems of our own, but our customers have a huge variety. I get to spend time understanding these problems, helping out with them, and thinking about how to make our platform a good place to solve them.
That said, my job is still very heavy on the engineering. I spend half of my time contributing to Apache Spark and some more of it writing code for customer engagements. It’s hard to get away from distributed systems over here.
Josh Wills says data scientists would be more accurately called “data janitors”. Do you agree?
At this point, thanks to the insistence of Josh and others, I think the statement that data science boils down to a bunch of data munging is mostly an accepted truism. But this fact is probably just a stronger endorsement of the term. Anybody who’s done “real” science knows that it’s not exactly about flying around, splashing nature with elegant statistical sprinkles, and “deriving insights”. (Not that I’ve done any real science.) If data science is 20% statistics and 80% engineering, and science is 1% inspiration and 99% perspiration, I’m just happy I get to save on my deodorant bill.
You’re presenting at Spark Summit 2014. What are your topics, and why should people attend?
I’ll be giving a couple talks. One is about Spark-on-YARN: a deep dive into how the Spark and YARN resource management models intersect, along with some practical advice on how to operate them on your own cluster. The other is about Oryx, MLLib, and machine learning infrastructure. I’ll be talking about what it means to put a machine learning model into production and our plan to base Oryx, an open source project that combines model-serving and model building, on Spark’s streaming and ML libraries.
What is your advice for aspiring data scientists?
This might apply to any job, but I think a good data scientist needs to work on being a good translator and a good teacher. Besides memorizing obscure R syntax and trying to convince their superiors that implementing a Dynamic Bayesian net is a good idea, a data scientist spends most of their time striving to understand problems in the language of the people who have them. For a statistical model or data pipeline to be worth anything, its creator needs to be able to articulate its limitations and the problem it solves.