Cloudera Developer Blog · Data Science Posts
[Updated Nov. 26, 2012: Sorry, this event has reached capacity and is now closed.]
Please join us in New York on Nov. 29, 2012, for a unique opportunity to hear from industry icons Jeff Hammerbacher (@hackingdata), Amr Awadallah (@awadallah) and Josh Wills (@josh_wills) as they discuss their approach to Data Science and how it transformed business for companies like Facebook, Yahoo! and Google. You will also hear more about Cloudera Enterprise: The Platform for Big Data powered by Cloudera Impala, which takes Hadoop “beyond batch” and into the world of real-time interactivity.
Last week at Strata + Hadoop World 2012, we announced a new data science training and certification program. I am very excited to have been part of the team that put the program together, and I would like to answer some of the most frequently asked questions about the course and the certification that we will be offering.
Why is Cloudera offering data science training?
The primary bottleneck on the success of Hadoop is the number of people who are capable of using it effectively to solve business problems. Addressing that bottleneck with training has always been a very large part of our mission here at Cloudera, and we are very fortunate to have one of the best training teams anywhere. So far, we have trained over 15,000 Hadoop developers and administrators, and our courses and certification exams are available all over the world.
We at Cloudera are tremendously excited by the power of data to effect large-scale change in the healthcare industry. Many of the projects that our data science team worked on in the past year originated as data-intensive problems in healthcare, such as analyzing adverse drug events and constructing case-control studies. Last summer, we announced that our Chief Scientist Jeff Hammerbacher would be collaborating with the Mt. Sinai School of Medicine to leverage large-scale data analysis with Apache Hadoop for the treatment and prevention of disease. And next week, it will be my great pleasure to host a panel of data scientists and researchers at the Strata Rx Conference (register with discount code SHARON for 25% off) to discuss the meaningful use of natural language processing in clinical care.
Of course, the cost-effective storage and analysis of massive quantities of text is one of Hadoop’s strengths, and Jimmy Lin’s book on text processing is an excellent way to learn how to think in MapReduce. But a close study of how the applications of natural language processing technology in healthcare have evolved over the last few years is instructive for anyone who wants to understand how to use data science in order to tackle seemingly intractable problems.
Lesson 1: Choose the Right Problem
You may have noticed that Harvard Business Review is calling data science “the sexiest job of the 21st century.” So our answer to the question is: Hot. Definitely hot.
If you need an explanation, watch the “Definition of a Data Scientist” talk embedded below from Cloudera data science director Josh Wills, which was hosted by Cloudera partner Lilien LLC recently in Portland, Ore. The key take-away is, you don’t literally have to be a “scientist,” just someone with the curiosity of one.