To paraphrase Nate Silver: “There is lots of data coming. Who will speak for all this data?”
Nearly every day, I read new articles about how Big Data is “changing everything.” Data scientists are unlocking new approaches that help researchers find the cure for cancer, banks fight fraud, the police fight drug-related crimes, and fantasy sports leaguers fight each other.
It seems like all I need is an analytics platform like Apache Hadoop and a big pile of data, and actionable insights will just leap out at me, right? Well… not quite. Hadoop makes the difficult easy and the impossible merely difficult. However, we still have to know what we’re looking for and, once we’ve found it, understand what the results mean.
The volume, velocity, and variety of Big Data make it hard to know where to focus and even harder to represent insights in a way that is consumable without sacrificing detail. Finding meaningful patterns and converting them into actionable insights requires plenty of computers, sophisticated software, and experts who can use these tools to coax answers from all our information. That is the realm of data science.
Data Science Defined
An aspiring data scientist needs a highly sought but difficult-to-attain combination of skills.
Like other scientists, a data scientist produces a hypothesis, runs an experiment, and looks at the results to determine whether the hypothesis holds true. In the Big Data space, though, the underlying processes are not quite so straightforward for three main reasons:
- Gathering enough perspective on a massive data set to generate a hypothesis can be a significant endeavor itself.
- Data science is most often analytical, not experimental, meaning the data has already been gathered as the very first step. This fact makes the notion of a controlled experiment impossible. Instead, data scientists have to do a form of experimental reverse engineering through careful modeling.
- The real work only begins after a data scientist has proven a hypothesis and discovered a useful pattern in the data. The true challenge lies in turning that pattern into a data product that can be used to analyze new data or perform ongoing predictive analysis.
To be successful, an aspiring data scientist needs a highly sought but difficult-to-attain combination of skills: statistics, programming, machine learning, and multiple technologies (such as Hadoop, R, and visualization tools). Moreover, the best data scientists distinguish themselves and create value for their companies by applying softer skills like domain expertise (life sciences, behavior classification, climate science), storytelling, and personal qualities like curiosity, resourcefulness, persistence, and mental dexterity. It’s a lot to ask for, and that’s why the likes of the McKinsey Global Institute, Harvard Business Review, and Gartner Group project a shortage in the hundreds of thousands of individuals with data science skills over the next few years.
Signal to Noise and Wheat from Chaff
Further complicating the supply/demand imbalance for data scientists is the absence of data scientist professional accreditations to verify capabilities. A small handful of universities have begun to offer degrees in advanced analytics and data science, but these programs are works-in-progress, have yet to graduate substantial numbers, and do not certify the mix of skills and experience required of a working data scientist beyond the classroom. There is no “International Board of Data Science” or “Data Science Institute,” and the vast majority of managers responsible for hiring data scientists have no data science experience themselves, so a résumé and interview alone will prove little. The dual problems of talent gap and talent non-verifiability will only become more pronounced as smaller businesses begin to accumulate Big Data and seek firepower in building sophisticated tools for it.
The dual problems of talent gap and talent non-verifiability will only become more pronounced.
One part of the solution is a formalized data science curriculum built by actual data scientists. Cloudera offers an excellent three-day Introduction to Data Science course that teaches the fundamentals and trains participants to build their own recommender systems based on insights from data science stars like Jeff Hammerbacher and Josh Wills. Another part of the solution is public data science competitions, through which individuals build experience and demonstrate their chops in a realistic setting.
A Challenge to Shape the Industry
But how much education and practice is enough when it comes to a job whose starting salary is regularly reported around $300,000 per year? This is where a formal industry certification would be most valuable, giving businesses a known yardstick by which to measure practitioners of the trade. At Cloudera, we’re drawing on our industry leadership and early corpus of real-world experience to address this gap. We recently introduced a two-part Cloudera Certified Professional: Data Scientist (CCP:DS) program, consisting of a traditional multiple-choice exam and a Web Analytics Challenge, focusing on classification, clustering, and collaborative filtering. The challenge runs for three months and will be followed by a new challenge one quarter later. Beyond the shadow of a doubt, the participants who successfully complete any CCP:DS Challenge will be verifiably among the world’s most employable (and extremely sexy) data scientists.
The first CCP:DS Challenge is open to participants until the end of September 2013, following which we will offer a variety of learning tools to understand the solutions and prepare for future challenges and exams. We’re even planning a Data Scientist Takeover of the Cloudera booth at Strata + Hadoop World 2013 on the evening of Mon., Oct. 28, to celebrate the inaugural CCP:DS Challenge, so please plan to join us in New York City for drinks, announcements, and a toast by Josh Wills.
Sarah Sproehnle is a vice president at Cloudera, responsible for training and certification programs.