Cloudera Developer Blog · Data Science Posts
Spark is a compelling multi-purpose platform for use cases that span investigative, as well as operational, analytics.
Data science is a broad church. I am a data scientist — or so I’ve been told — but what I do is actually quite different from what other “data scientists” do. For example, there are those practicing “investigative analytics” and those implementing “operational analytics.” (I’m in the second camp.)
The new RImpala package brings the speed and interactivity of Impala to queries from R.
Our thanks to Austin Chungath, Sachin Sudarshana, and Vikas Raguttahalli of Mu Sigma, a Decision Sciences and Big Data analytics company, for the guest post below.
Our thanks to Concurrent Inc. for the how-to below about using Cascading Pattern with CDH. Cloudera recently tested CDH 4.4 with the Cascading Compatibility Test Suite verifying compatibility with Cascading 2.2.
Cascading Pattern is a machine-learning project within the Cascading development framework used to build enterprise data workflows. Cascading provides an abstraction layer on top of Apache Hadoop and other computing topologies that allows enterprises to leverage existing skills and resources to build data processing applications on Hadoop, without the need for specialized Hadoop skills.
Thanks to Victor Bittorf, a visiting graduate computer science student at Stanford University, for the guest post below about how to use the new prebuilt analytic functions for Cloudera Impala.
Cloudera Impala is an exciting project that unlocks interactive queries and SQL analytics on big data. Over the past few months I have been working with the Impala team to extend Impala’s analytic capabilities. Today I am happy to announce the availability of pre-built mathematical and statistical algorithms for the Impala community under a free open-source license. These pre-built algorithms combine recent theoretical techniques for shared nothing parallelization for analytics and the new user-defined aggregations (UDA) framework in Impala 1.2 in order to achieve big data scalability. This initial release has support for logistic regression, support vector machines (SVMs), and linear regression.
It’s common to hear people describe themselves as being “left-brained” or “right-brained” based on their tendency to be more logical and mathematically driven (left-brained), or, conversely, to be intuitive and creatively driven (right-brained). For example, people who prefer math over art are often considered left-brained. People who get a higher verbal score on their SATs than for math are often considered right-brained.
In general, language and creative writing are considered right-brained exercises. Many people also associate marketing and advertising as a right-brained function, whereas engineering is considered very left-brained.
In this installment of “Meet the Project Founder,” we speak with Josh Wills (@josh_wills), Cloudera’s Senior Director of Data Science and founder of Apache Crunch and Cloudera ML.
What led you to your project idea(s)?
When I first started at Cloudera in 2011, I had a fairly vague job description, no real responsibilities, and wasn’t all that familiar with the Apache Hadoop stack, so I started working on various pet projects in order to learn more about the tools and the use cases in domains like healthcare and energy.
To paraphrase Nate Silver: “There is lots of data coming. Who will speak for all this data?”
Nearly every day, I read new articles about how Big Data is “changing everything.” Data scientists are unlocking new approaches that help researchers find the cure for cancer, banks fight fraud, the police fight drug-related crimes, and fantasy sports leaguers fight each other.
What a short, strange trip it’s been. Just a year ago, I founded Myrrix in London’s Silicon Roundabout to commercialize large-scale machine learning based on Apache Hadoop and Apache Mahout. It’s been a busy scramble, building software and proudly watching early customers get real, big data-sized machine learning into production.
And now another beginning: Myrrix has a new home in Cloudera. I’m excited to join as Director of Data Science in London, alongside Josh Wills. Some of the Myrrix technology will be coming along to benefit CDH and its customers too. There was no question that Cloudera is the right place to continue building out the vision that started as Myrrix, because Josh, Jeff Hammerbacher and the rest of the data science team here have the same vision. It’s an unusually perfect match. Cloudera has made an increasingly complex big-data ecosystem increasingly accessible (Hadoop, real-time queries, search), and we’re going to make “Big Learning” on Hadoop easy and accessible too.
What is Old is New Again
Data-savvy companies of all sizes can now accomplish many viable machine learning projects.
On Monday April 29, Cloudera announced a strategic alliance with SAS. As the industry leader in business analytics software, SAS brings a formidable toolset to bear on the problem of extracting business value from large volumes of data.
Over the past few months, Cloudera has been hard at work along with the SAS team to integrate a number of SAS products with Apache Hadoop, delivering the ability for our customers to use these tools in their interaction with data on the Cloudera platform. In this post, we will delve into the major mechanisms that are available for connecting SAS to CDH, Cloudera’s 100% open-source distribution including Hadoop.
SAS/ACCESS to Hadoop
Data scientists, that peculiar mix of software engineer and statistician, are notoriously difficult to interview. One approach that I’ve used over the years is to pose a problem that requires some mixture of algorithm design and probability theory in order to come up with an answer. Here’s an example of this type of question that has been popular in Silicon Valley for a number of years:
Say you have a stream of items of large and unknown length that we can only iterate over once. Create an algorithm that randomly chooses an item from this stream such that each item is equally likely to be selected.