Author Archives: Josh Wills (@josh_wills)

New in Cloudera Labs: Google Cloud Dataflow on Apache Spark

Categories: Cloudera Labs Spark

Cloudera and Google are collaborating to bring Google Cloud Dataflow to Apache Spark users (and vice-versa). This new project is now incubating in Cloudera Labs!

“The future is already here—it’s just not evenly distributed.” —William Gibson

For the past decade, a lot of the future has been concentrated at Google’s headquarters in Mountain View. Because of the scale of its operations, Google usually bumped up against the limitations of the current state-of-the-art before anyone else,

Read More

How-to: Count Events Like a Data Scientist

Categories: Data Science How-to Use Case

The ability to quickly and accurately count complex events is a legitimate business advantage.

In our work as data scientists, we spend most of our time counting things. It is the foundational skill that is used in data cleansing, reporting, feature engineering, and simple-but-effective machine learning models like Naive Bayes classifiers. Hilary Mason has a quote about the benefits of counting that I love:

Understand that what big data really means is to be able to count things in data sets of any size,

Read More

Algorithms Every Data Scientist Should Know: Reservoir Sampling

Categories: Data Science How-to

Data scientists, that peculiar mix of software engineer and statistician, are notoriously difficult to interview. One approach that I’ve used over the years is to pose a problem that requires some mixture of algorithm design and probability theory in order to come up with an answer. Here’s an example of this type of question that has been popular in Silicon Valley for a number of years: 

Say you have a stream of items of large and unknown length that we can only iterate over once.

Read More

Cloudera ML: New Open Source Libraries and Tools for Data Scientists

Categories: Community Data Science General Mahout MapReduce Tools

Editor’s note (12/19/2013): Cloudera ML has been merged into the Oryx project. The information below is still valid though.

Last month, Apache Crunch became the fifth project (along with Sqoop, Flume, Bigtop, and MRUnit) to go from Cloudera’s github repository through the Apache Incubator and on to graduate as a top-level project within the Apache Software Foundation. As the founder of the project and a newly minted Apache VP,

Read More

Training a New Generation of Data Scientists

Categories: Data Science General Training

Last week at Strata + Hadoop World 2012, we announced a new data science training and certification program. I am very excited to have been part of the team that put the program together, and I would like to answer some of the most frequently asked questions about the course and the certification that we will be offering.

Why is Cloudera offering data science training?

The primary bottleneck on the success of Hadoop is the number of people who are capable of using it effectively to solve business problems.

Read More