Cloudera Developer Blog · Data Science Posts
Our thanks to Concurrent Inc. for the how-to below about using Cascading Pattern with CDH. Cloudera recently tested CDH 4.4 with the Cascading Compatibility Test Suite verifying compatibility with Cascading 2.2.
Cascading Pattern is a machine-learning project within the Cascading development framework used to build enterprise data workflows. Cascading provides an abstraction layer on top of Apache Hadoop and other computing topologies that allows enterprises to leverage existing skills and resources to build data processing applications on Hadoop, without the need for specialized Hadoop skills.
Pattern, in particular, leverages an industry standard called Predictive Model Markup Language (PMML), which allows data scientists to leverage their favorite statistical and analytics tools (such as R, SAS, Oracle, and so on) to export predictive models and quickly run them on data sets stored in Hadoop. Pattern’s benefits include reduced development costs, time savings, and reduced licensing issues at scale – all while leveraging Hadoop clusters, core competencies of analytics staff, and existing intellectual property in the predictive models.
Thanks to Victor Bittorf, a visiting graduate computer science student at Stanford University, for the guest post below about how to use the new prebuilt analytic functions for Cloudera Impala.
Cloudera Impala is an exciting project that unlocks interactive queries and SQL analytics on big data. Over the past few months I have been working with the Impala team to extend Impala’s analytic capabilities. Today I am happy to announce the availability of pre-built mathematical and statistical algorithms for the Impala community under a free open-source license. These pre-built algorithms combine recent theoretical techniques for shared nothing parallelization for analytics and the new user-defined aggregations (UDA) framework in Impala 1.2 in order to achieve big data scalability. This initial release has support for logistic regression, support vector machines (SVMs), and linear regression.
Having recently completed my masters degree while working in the database systems group at University of Madison Wisconsin, I’m excited to work with the Impala team on this project while I continue my research as a visiting student at Stanford. I’m going to go through some details about what we’ve implemented and how to use it.
It’s common to hear people describe themselves as being “left-brained” or “right-brained” based on their tendency to be more logical and mathematically driven (left-brained), or, conversely, to be intuitive and creatively driven (right-brained). For example, people who prefer math over art are often considered left-brained. People who get a higher verbal score on their SATs than for math are often considered right-brained.
In general, language and creative writing are considered right-brained exercises. Many people also associate marketing and advertising as a right-brained function, whereas engineering is considered very left-brained.
But Big Data is changing this. Many companies are applying math and engineering to creative writing and marketing in order to optimize marketing campaigns’ results. Persado has actually built its business around this idea.
In this installment of “Meet the Project Founder,” we speak with Josh Wills (@josh_wills), Cloudera’s Senior Director of Data Science and founder of Apache Crunch and Cloudera ML.
What led you to your project idea(s)?
When I first started at Cloudera in 2011, I had a fairly vague job description, no real responsibilities, and wasn’t all that familiar with the Apache Hadoop stack, so I started working on various pet projects in order to learn more about the tools and the use cases in domains like healthcare and energy.
My first project, analyzing adverse drug events, involved lots of Apache Pig programming. I liked Pig’s data flow programming model, but I didn’t enjoy writing user-defined functions: I had to step out of vim, switch to an IDE, read a lot of Javadoc about Pig’s internal data model, code for a while, compile something, switch back to vim, find bugs, go back to the IDE, and so on and so on. To this day, I am a big fan of Apache Hive and Pig right up to the point where I have to write a UDF, and then I die a little bit inside. (That said, the StreamingQuantile function that I adapted from Sawzall and contributed to DataFu is arguably the most useful thing I’ve done at Cloudera.)
To paraphrase Nate Silver: “There is lots of data coming. Who will speak for all this data?”
Nearly every day, I read new articles about how Big Data is “changing everything.” Data scientists are unlocking new approaches that help researchers find the cure for cancer, banks fight fraud, the police fight drug-related crimes, and fantasy sports leaguers fight each other.
It seems like all I need is an analytics platform like Apache Hadoop and a big pile of data, and actionable insights will just leap out at me, right? Well… not quite. Hadoop makes the difficult easy and the impossible merely difficult. However, we still have to know what we’re looking for and, once we’ve found it, understand what the results mean.
What a short, strange trip it’s been. Just a year ago, I founded Myrrix in London’s Silicon Roundabout to commercialize large-scale machine learning based on Apache Hadoop and Apache Mahout. It’s been a busy scramble, building software and proudly watching early customers get real, big data-sized machine learning into production.
And now another beginning: Myrrix has a new home in Cloudera. I’m excited to join as Director of Data Science in London, alongside Josh Wills. Some of the Myrrix technology will be coming along to benefit CDH and its customers too. There was no question that Cloudera is the right place to continue building out the vision that started as Myrrix, because Josh, Jeff Hammerbacher and the rest of the data science team here have the same vision. It’s an unusually perfect match. Cloudera has made an increasingly complex big-data ecosystem increasingly accessible (Hadoop, real-time queries, search), and we’re going to make “Big Learning” on Hadoop easy and accessible too.
What is Old is New Again
Data-savvy companies of all sizes can now accomplish many viable machine learning projects.
Why the fuss now about machine learning, a decades-old field? I started working on recommender systems relatively late, in 2005, as the open-source project Taste. In 2008, this was merged into the open source machine learning project Apache Mahout, and rebuilt on top of a nascent Hadoop project. Yet as a committer and part of the Mahout PMC, I have watched interest in machine learning suddenly reignite, and skyrocket, as interest in this new Hadoop thing did.
On Monday April 29, Cloudera announced a strategic alliance with SAS. As the industry leader in business analytics software, SAS brings a formidable toolset to bear on the problem of extracting business value from large volumes of data.
Over the past few months, Cloudera has been hard at work along with the SAS team to integrate a number of SAS products with Apache Hadoop, delivering the ability for our customers to use these tools in their interaction with data on the Cloudera platform. In this post, we will delve into the major mechanisms that are available for connecting SAS to CDH, Cloudera’s 100% open-source distribution including Hadoop.
SAS/ACCESS to Hadoop
SAS/ACCESS provides the ability to access data sets stored in Hadoop in SAS natively. With SAS/Access to Hadoop:
Data scientists, that peculiar mix of software engineer and statistician, are notoriously difficult to interview. One approach that I’ve used over the years is to pose a problem that requires some mixture of algorithm design and probability theory in order to come up with an answer. Here’s an example of this type of question that has been popular in Silicon Valley for a number of years:
Say you have a stream of items of large and unknown length that we can only iterate over once. Create an algorithm that randomly chooses an item from this stream such that each item is equally likely to be selected.
The first thing to do when you find yourself confronted with such a question is to stay calm. The data scientist who is interviewing you isn’t trying to trick you by asking you to do something that is impossible. In fact, this data scientist is desperate to hire you. She is buried under a pile of analysis requests, her ETL pipeline is broken, and her machine learning model is failing to converge. Her only hope is to hire smart people such as yourself to come in and help. She wants you to succeed.
Hue 2.2 , the open source web-based interface that makes Apache Hadoop easier to use, lets you interact with Hadoop services from within your browser without having to go to a command-line interface. It features different applications like an Apache Hive editor and Apache Oozie dashboard and workflow builder.
This post is based on our “Analyzing Twitter Data with Hadoop” sample app and details how the same results can be achieved through Hue in a simpler way. Moreover, all the code and examples of the previous series have been updated to the recent CDH4.2 release.
The first step is to create the “flume” user and his home on the HDFS where the data will be stored. This can be done via the User Admin application.
The following guest post comes to you from Alan Gardner of remote database services and consulting company Pythian, who participated in Data Hacking Day (and was on the winning team!) at Cloudera’s offices in February.
Last Feb. 25, just prior to attending Strata, Alex Gorbachev (our CTO) and I had the chance to visit Cloudera’s Palo Alto offices for Data Hacking Day. The goal of the event was to produce something cool that leverages Cloudera Impala – the new open source, low-latency platform for querying data in Apache Hadoop.
Our hosts helpfully suggested some datasets, including the DEBS 2013 Grand Challenge data. This dataset contains the position of all the players and ball during a football match; our project was to map the data for a given span of time and player onto a map of the field, to create a heatmap of how much time that player spent at different positions.