Cloudera Developer Blog · Data Science Posts
The following guest post comes to you from Alan Gardner of remote database services and consulting company Pythian, who participated in Data Hacking Day (and was on the winning team!) at Cloudera’s offices in February.
Last Feb. 25, just prior to attending Strata, Alex Gorbachev (our CTO) and I had the chance to visit Cloudera’s Palo Alto offices for Data Hacking Day. The goal of the event was to produce something cool that leverages Cloudera Impala – the new open source, low-latency platform for querying data in Apache Hadoop.
Our hosts helpfully suggested some datasets, including the DEBS 2013 Grand Challenge data. This dataset contains the position of all the players and ball during a football match; our project was to map the data for a given span of time and player onto a map of the field, to create a heatmap of how much time that player spent at different positions.
In this Charlie Rose interview that aired on March 22, 2013, Cloudera’s Chief Scientist Jeff Hammerbacher (@hackingdata) offers fascinating insights into the origins of Big Data and data science techniques at Google and their re-implementation into open source used by consumer Web companies. Furthermore, he offers great detail about their positive application across healthcare diagnostics and delivery – as well as the overall need for better balance between “numerical imagination” and “narrative imagination” in everything we do (in order to “ask bigger questions”, as some would say).
It’s an incredibly valuable look into where Big Data came from, where it’s going, and how Cloudera is helping it get there.
Editor’s note (12/19/2013): Cloudera ML has been merged into the Oryx project. The information below is still valid though.
Last month, Apache Crunch became the fifth project (along with Sqoop, Flume, Bigtop, and MRUnit) to go from Cloudera’s github repository through the Apache Incubator and on to graduate as a top-level project within the Apache Software Foundation. As the founder of the project and a newly minted Apache VP, I wanted to take this opportunity to express my gratitude to the Crunch community, who have taught me that leadership in the Apache Way means service, humility, and investing more time in building a community than I spend writing code. Working with you all on our shared vision is the highlight of every work week.
Creating Analytical Applications with Crunch: Cloudera ML
The Crunch Java libraries operate at a lower level of abstraction than other tools for creating MapReduce pipelines, like Apache Pig, Apache Hive, or Cascading. Crunch does not make any assumptions about the data model in your pipeline, which makes it easy to create data pipelines over non-relational data sources such as time series, Avro records, and Mahout Vectors. In fact, I originally wrote Crunch while I was working on Seismic Hadoop, a command line tool for processing time series of seismic measurements on Hadoop.
Data scientists drive data as a platform to answer previously unimaginable questions. These multi-talented data professionals are in demand like never before because they identify or create some of the most exciting and potentially profitable business opportunities across industries. However, a scarcity of existing external talent will require companies of all sizes to find, develop, and train their people with backgrounds in software engineering, statistics, or traditional business intelligence as the next generation of data scientists.
Join us for the premiere of Training a New Generation of Data Scientists on Tuesday, March 26, at 2pm ET/11am PT. In this video, Cloudera’s Senior Director of Data Science, Josh Wills, will discuss what data scientists do, how they think about problems, the relationship between data science and Hadoop, and how Cloudera training can help you join this increasingly important profession. Following the video, Josh will answer your questions about data science, Hadoop, and Cloudera’s Introduction to Data Science: Building Recommender Systems course.
UPDATED 20130424: The new RHadoop treats output to Streaming a bit differently, so
do.trace=FALSE must be set in the
UPDATED 20130408: Antonio Piccolboni, the author of RHadoop, has improved the code somewhat using his substantially greater experience with R. The most material change is that the latest version of RHadoop can bind multiple calls to keyval correctly.
Internet-scale data sets present a unique challenge to traditional machine-learning techniques, such as fitting random forests or “bagging“. In order to fit a classifier to a large data set, it’s common to generate many smaller data sets derived from the initial large data set (i.e.,resampling). There are two reasons for this:
Cloudera University is the world leader in Apache Hadoop training and certification. Our full suite of live courses and online materials is the best resource to get started with your Hadoop cluster in development or advance it towards production. We offer deep industry insight into the skills and expertise required to establish yourself as a leading Developer or Administrator managing and processing Big Data in this fast-growing field.
But did you know Cloudera training can also help you plan for the advanced stages and progress of your Hadoop cluster? In addition to core training for Developers and Administrators, we also offer the best (and, in some cases, only) opportunity to get up to speed on lifecycle projects within the Hadoop ecosystem in a classroom setting. Cloudera University’s course offerings go beyond the basics to include Training for Apache HBase, Training for Apache Hive and Pig, and Introduction to Data Science: Building Recommender Systems. Depending on your Big Data agenda, Cloudera training can help you increase the accessibility and queryability of your data, push your data performance towards real-time, conduct business-critical analyses using familiar scripting languages, build new applications and customer-facing products, and conduct data experiments to improve your overall productivity and profitability.
For a limited time, Cloudera University is offering a 15% discount when you register for two or more Hadoop training courses to help you build out and realize your Big Data plan. Cover the basics with Developer or Administrator training, move beyond the HDFS and MapReduce core by pairing Developer and HBase training, work towards machine learning with Hive and Pig training and Introduction to Data Science, or customize your own learning path. Just use discount code 15off2 when you register for multiple public training classes from Cloudera University. This offer is only available for new enrollments and is only valid for classes delivered by Cloudera and scheduled to begin before March 1, 2013.
I recently joined Cloudera after working in computational biology/genomics for close to a decade. My analytical work is primarily performed in Python, along with its fantastic scientific stack. It was quite jarring to find out that the Apache Hadoop ecosystem is primarily written in/for Java. So my first order of business was to investigate some of the options that exist for working with Hadoop from Python.
In this post, I will provide an unscientific, ad hoc review of my experiences with some of the Python frameworks that exist for working with Hadoop, including:
Data science has been a ubiquitous topic of conversation in the IT and business worlds across the month of November. In this brief post, I’ll bring you just a small cross-section of the data science meme on the Interwebs in the past 4 weeks:
Cloudera recently announced the general availability of CDH4.1, an update to our open-source, enterprise-ready distribution of Apache Hadoop and related projects. Among various components, Apache Mahout is a relatively recent addition to CDH (first added to CDH3u2 in 2011), but is already attracting increasing interest out in the field.
Mahout started as a sub-project of Apache Lucene to provide machine-learning libraries in the area of clustering and classification. It later evolved into a top-level Apache project with much broader coverage of machine-learning techniques (clustering, classification, recommendation, frequent itemset mining etc.).
In CDH4.1, Mahout is upgraded to upstream version 0.7. Several new changes are included in this release, and this post will briefly go over some of the interesting ones.
Outlier Removal Capability
[Updated Nov. 26, 2012: Sorry, this event has reached capacity and is now closed.]
Please join us in New York on Nov. 29, 2012, for a unique opportunity to hear from industry icons Jeff Hammerbacher (@hackingdata), Amr Awadallah (@awadallah) and Josh Wills (@josh_wills) as they discuss their approach to Data Science and how it transformed business for companies like Facebook, Yahoo! and Google. You will also hear more about Cloudera Enterprise: The Platform for Big Data powered by Cloudera Impala, which takes Hadoop “beyond batch” and into the world of real-time interactivity.
All are welcome – however, quantitative analysts, Hadoop users/developers, business management or those involved in business intelligence and enterprise data warehousing projects would benefit greatly from attending.