Category Archives: Data Science

A Guide to Python Frameworks for Hadoop

Categories: Data Science Hadoop

I recently joined Cloudera after working in computational biology/genomics for close to a decade. My analytical work is primarily performed in Python, along with its fantastic scientific stack. It was quite jarring to find out that the Apache Hadoop ecosystem is primarily written in/for Java. So my first order of business was to investigate some of the options that exist for working with Hadoop from Python.

Read more

This Month in Data Science

Categories: Careers Data Science Training

Data science has been a ubiquitous topic of conversation in the IT and business worlds across the month of November. In this brief post, I’ll bring you just a small cross-section of the data science meme on the Interwebs in the past 4 weeks:

  • As part of its annual “Best Jobs 2012” feature, CNNMoney called data science one of the “best new jobs in America” – right up there with “video game designer”

Read more

What’s New in CDH4.1 Mahout

Categories: CDH Data Science Mahout

Cloudera recently announced the general availability of CDH4.1, an update to our open-source, enterprise-ready distribution of Apache Hadoop and related projects. Among various components, Apache Mahout is a relatively recent addition to CDH (first added to CDH3u2 in 2011), but is already attracting increasing interest out in the field. 

Mahout started as a sub-project of Apache Lucene to provide machine-learning libraries in the area of clustering and classification. It later evolved into a top-level Apache project with much broader coverage of machine-learning techniques (clustering,

Read more

See You at Data Science Day (Nov. 29, New York)!

Categories: Data Science Impala

[Updated Nov. 26, 2012: Sorry, this event has reached capacity and is now closed.]

Please join us in New York on Nov. 29, 2012, for a unique opportunity to hear from industry icons Jeff Hammerbacher (@hackingdata), Amr Awadallah (@awadallah) and Josh Wills (@josh_wills) as they discuss their approach to Data Science and how it transformed business for companies like Facebook, Yahoo! and Google. You will also hear more about Cloudera Enterprise: The Platform for Big Data powered by Cloudera Impala,

Read more

Training a New Generation of Data Scientists

Categories: Data Science General Training

Last week at Strata + Hadoop World 2012, we announced a new data science training and certification program. I am very excited to have been part of the team that put the program together, and I would like to answer some of the most frequently asked questions about the course and the certification that we will be offering.

Why is Cloudera offering data science training?

The primary bottleneck on the success of Hadoop is the number of people who are capable of using it effectively to solve business problems.

Read more