Cloudera Developer Blog · Data Science Posts
Data scientists drive data as a platform to answer previously unimaginable questions. These multi-talented data professionals are in demand like never before because they identify or create some of the most exciting and potentially profitable business opportunities across industries. However, a scarcity of existing external talent will require companies of all sizes to find, develop, and train their people with backgrounds in software engineering, statistics, or traditional business intelligence as the next generation of data scientists.
Join us for the premiere of Training a New Generation of Data Scientists on Tuesday, March 26, at 2pm ET/11am PT. In this video, Cloudera’s Senior Director of Data Science, Josh Wills, will discuss what data scientists do, how they think about problems, the relationship between data science and Hadoop, and how Cloudera training can help you join this increasingly important profession. Following the video, Josh will answer your questions about data science, Hadoop, and Cloudera’s Introduction to Data Science: Building Recommender Systems course.
UPDATED 20130424: The new RHadoop treats output to Streaming a bit differently, so
do.trace=FALSE must be set in the
UPDATED 20130408: Antonio Piccolboni, the author of RHadoop, has improved the code somewhat using his substantially greater experience with R. The most material change is that the latest version of RHadoop can bind multiple calls to keyval correctly.
Cloudera University is the world leader in Apache Hadoop training and certification. Our full suite of live courses and online materials is the best resource to get started with your Hadoop cluster in development or advance it towards production. We offer deep industry insight into the skills and expertise required to establish yourself as a leading Developer or Administrator managing and processing Big Data in this fast-growing field.
But did you know Cloudera training can also help you plan for the advanced stages and progress of your Hadoop cluster? In addition to core training for Developers and Administrators, we also offer the best (and, in some cases, only) opportunity to get up to speed on lifecycle projects within the Hadoop ecosystem in a classroom setting. Cloudera University’s course offerings go beyond the basics to include Training for Apache HBase, Training for Apache Hive and Pig, and Introduction to Data Science: Building Recommender Systems. Depending on your Big Data agenda, Cloudera training can help you increase the accessibility and queryability of your data, push your data performance towards real-time, conduct business-critical analyses using familiar scripting languages, build new applications and customer-facing products, and conduct data experiments to improve your overall productivity and profitability.
I recently joined Cloudera after working in computational biology/genomics for close to a decade. My analytical work is primarily performed in Python, along with its fantastic scientific stack. It was quite jarring to find out that the Apache Hadoop ecosystem is primarily written in/for Java. So my first order of business was to investigate some of the options that exist for working with Hadoop from Python.
In this post, I will provide an unscientific, ad hoc review of my experiences with some of the Python frameworks that exist for working with Hadoop, including:
Data science has been a ubiquitous topic of conversation in the IT and business worlds across the month of November. In this brief post, I’ll bring you just a small cross-section of the data science meme on the Interwebs in the past 4 weeks:
Cloudera recently announced the general availability of CDH4.1, an update to our open-source, enterprise-ready distribution of Apache Hadoop and related projects. Among various components, Apache Mahout is a relatively recent addition to CDH (first added to CDH3u2 in 2011), but is already attracting increasing interest out in the field.
Mahout started as a sub-project of Apache Lucene to provide machine-learning libraries in the area of clustering and classification. It later evolved into a top-level Apache project with much broader coverage of machine-learning techniques (clustering, classification, recommendation, frequent itemset mining etc.).
[Updated Nov. 26, 2012: Sorry, this event has reached capacity and is now closed.]
Please join us in New York on Nov. 29, 2012, for a unique opportunity to hear from industry icons Jeff Hammerbacher (@hackingdata), Amr Awadallah (@awadallah) and Josh Wills (@josh_wills) as they discuss their approach to Data Science and how it transformed business for companies like Facebook, Yahoo! and Google. You will also hear more about Cloudera Enterprise: The Platform for Big Data powered by Cloudera Impala, which takes Hadoop “beyond batch” and into the world of real-time interactivity.
Last week at Strata + Hadoop World 2012, we announced a new data science training and certification program. I am very excited to have been part of the team that put the program together, and I would like to answer some of the most frequently asked questions about the course and the certification that we will be offering.
Why is Cloudera offering data science training?
The primary bottleneck on the success of Hadoop is the number of people who are capable of using it effectively to solve business problems. Addressing that bottleneck with training has always been a very large part of our mission here at Cloudera, and we are very fortunate to have one of the best training teams anywhere. So far, we have trained over 15,000 Hadoop developers and administrators, and our courses and certification exams are available all over the world.
We at Cloudera are tremendously excited by the power of data to effect large-scale change in the healthcare industry. Many of the projects that our data science team worked on in the past year originated as data-intensive problems in healthcare, such as analyzing adverse drug events and constructing case-control studies. Last summer, we announced that our Chief Scientist Jeff Hammerbacher would be collaborating with the Mt. Sinai School of Medicine to leverage large-scale data analysis with Apache Hadoop for the treatment and prevention of disease. And next week, it will be my great pleasure to host a panel of data scientists and researchers at the Strata Rx Conference (register with discount code SHARON for 25% off) to discuss the meaningful use of natural language processing in clinical care.
Of course, the cost-effective storage and analysis of massive quantities of text is one of Hadoop’s strengths, and Jimmy Lin’s book on text processing is an excellent way to learn how to think in MapReduce. But a close study of how the applications of natural language processing technology in healthcare have evolved over the last few years is instructive for anyone who wants to understand how to use data science in order to tackle seemingly intractable problems.
Lesson 1: Choose the Right Problem
You may have noticed that Harvard Business Review is calling data science “the sexiest job of the 21st century.” So our answer to the question is: Hot. Definitely hot.
If you need an explanation, watch the “Definition of a Data Scientist” talk embedded below from Cloudera data science director Josh Wills, which was hosted by Cloudera partner Lilien LLC recently in Portland, Ore. The key take-away is, you don’t literally have to be a “scientist,” just someone with the curiosity of one.