Category Archives: Data Science

How-to: Resample from a Large Data Set in Parallel (with R on Hadoop)

Categories: Data Science How-to

UPDATED 20130424: The new RHadoop treats output to Streaming a bit differently, so do.trace=FALSE must be set in the randomForest call.

UPDATED 20130408: Antonio Piccolboni, the author of RHadoop, has improved the code somewhat using his substantially greater experience with R. The most material change is that the latest version of RHadoop can bind multiple calls to keyval correctly.

Internet-scale data sets present a unique challenge to traditional machine-learning techniques,

Read more

Save 15% on Multi-Course Public Training Enrollments in January and February

Categories: Data Science Hadoop HBase Hive Pig Training

Cloudera University is the world leader in Apache Hadoop training and certification. Our full suite of live courses and online materials is the best resource to get started with your Hadoop cluster in development or advance it towards production.  We offer deep industry insight into the skills and expertise required to establish yourself as a leading Developer or Administrator managing and processing Big Data in this fast-growing field.

Read more

A Guide to Python Frameworks for Hadoop

Categories: Data Science Hadoop

I recently joined Cloudera after working in computational biology/genomics for close to a decade. My analytical work is primarily performed in Python, along with its fantastic scientific stack. It was quite jarring to find out that the Apache Hadoop ecosystem is primarily written in/for Java. So my first order of business was to investigate some of the options that exist for working with Hadoop from Python.

Read more

This Month in Data Science

Categories: Careers Data Science Training

Data science has been a ubiquitous topic of conversation in the IT and business worlds across the month of November. In this brief post, I’ll bring you just a small cross-section of the data science meme on the Interwebs in the past 4 weeks:

  • As part of its annual “Best Jobs 2012” feature, CNNMoney called data science one of the “best new jobs in America” – right up there with “video game designer”

Read more

What’s New in CDH4.1 Mahout

Categories: CDH Data Science Mahout

Cloudera recently announced the general availability of CDH4.1, an update to our open-source, enterprise-ready distribution of Apache Hadoop and related projects. Among various components, Apache Mahout is a relatively recent addition to CDH (first added to CDH3u2 in 2011), but is already attracting increasing interest out in the field. 

Mahout started as a sub-project of Apache Lucene to provide machine-learning libraries in the area of clustering and classification. It later evolved into a top-level Apache project with much broader coverage of machine-learning techniques (clustering,

Read more