Category Archives: Data Science

Get Hired as a Certified Data Scientist

Categories: Data Science Training

To paraphrase Nate Silver: “There is lots of data coming. Who will speak for all this data?”

Nearly every day, I read new articles about how Big Data is “changing everything.” Data scientists are unlocking new approaches that help researchers find the cure for cancer, banks fight fraud, the police fight drug-related crimes, and fantasy sports leaguers fight each other.

It seems like all I need is an analytics platform like Apache Hadoop and a big pile of data,

Read more

Myrrix Joins Cloudera to Bring "Big Learning" to Hadoop

Categories: Data Science Hadoop Mahout

What a short, strange trip it’s been. Just a year ago, I founded Myrrix in London’s Silicon Roundabout to commercialize large-scale machine learning based on Apache Hadoop and Apache Mahout. It’s been a busy scramble, building software and proudly watching early customers get real, big data-sized machine learning into production.

And now another beginning: Myrrix has a new home in Cloudera. I’m excited to join as Director of Data Science in London,

Read more

How the SAS and Cloudera Platforms Work Together

Categories: CDH Data Science Hadoop Impala

On Monday April 29, Cloudera announced a strategic alliance with SAS. As the industry leader in business analytics software, SAS brings a formidable toolset to bear on the problem of extracting business value from large volumes of data.

Over the past few months, Cloudera has been hard at work along with the SAS team to integrate a number of SAS products with Apache Hadoop, delivering the ability for our customers to use these tools in their interaction with data on the Cloudera platform.

Read more

Algorithms Every Data Scientist Should Know: Reservoir Sampling

Categories: Data Science How-to

Data scientists, that peculiar mix of software engineer and statistician, are notoriously difficult to interview. One approach that I’ve used over the years is to pose a problem that requires some mixture of algorithm design and probability theory in order to come up with an answer. Here’s an example of this type of question that has been popular in Silicon Valley for a number of years: 

Say you have a stream of items of large and unknown length that we can only iterate over once.

Read more

How-to: Analyze Twitter Data with Hue

Categories: Data Science Flume Hive How-to Hue

Hue 2.2 , the open source web-based interface that makes Apache Hadoop easier to use, lets you interact with Hadoop services from within your browser without having to go to a command-line interface. It features different applications like an Apache Hive editor and Apache Oozie dashboard and workflow builder.

This post is based on our “Analyzing Twitter Data with Hadoop” sample app and details how the same results can be achieved through Hue in a simpler way.

Read more