Cloudera Developer Blog
Big Data best practices, how-to's, and internals from Cloudera Engineering and the community
History teaches us that ecosystem growth is fueled by enthusiasm, tools (including frameworks and APIs), and knowledge in roughly equal measures. To this point, the Apache Hadoop ecosystem has been blessed with the first two ingredients – thanks to the magic of open source – but in the third category, there is still plenty of work to be done.
For Cloudera, our Academic Partnership program is a major part of that effort. Through that program, accredited nonprofit universities around the world get access to Cloudera’s own Hadoop curriculum for their computer science departments, in addition to discounted training and certification for students and instructors. Thus far, Cloudera Academic Partners include (but are not limited to) San Jose State University, DePaul University, Fordham University, Vanderbilt University, Technische Universität Berlin, and Rensselaer Polytechnic Institute.
Welcome to our third edition of “This Month in the Ecosystem,” a digest of highlights from September 2013 (never intended to be comprehensive; for completeness, see Hadoop Weekly).
Note: there were a few other interesting developments this week, but out of respect for the calendar, I’ll address them next month.
It’s common to hear people describe themselves as being “left-brained” or “right-brained” based on their tendency to be more logical and mathematically driven (left-brained), or, conversely, to be intuitive and creatively driven (right-brained). For example, people who prefer math over art are often considered left-brained. People who get a higher verbal score on their SATs than for math are often considered right-brained.
In general, language and creative writing are considered right-brained exercises. Many people also associate marketing and advertising as a right-brained function, whereas engineering is considered very left-brained.
But Big Data is changing this. Many companies are applying math and engineering to creative writing and marketing in order to optimize marketing campaigns’ results. Persado has actually built its business around this idea.
In this installment of “Meet the Project Founder,” we speak with Josh Wills (@josh_wills), Cloudera’s Senior Director of Data Science and founder of Apache Crunch and Cloudera ML.
What led you to your project idea(s)?
When I first started at Cloudera in 2011, I had a fairly vague job description, no real responsibilities, and wasn’t all that familiar with the Apache Hadoop stack, so I started working on various pet projects in order to learn more about the tools and the use cases in domains like healthcare and energy.
My first project, analyzing adverse drug events, involved lots of Apache Pig programming. I liked Pig’s data flow programming model, but I didn’t enjoy writing user-defined functions: I had to step out of vim, switch to an IDE, read a lot of Javadoc about Pig’s internal data model, code for a while, compile something, switch back to vim, find bugs, go back to the IDE, and so on and so on. To this day, I am a big fan of Apache Hive and Pig right up to the point where I have to write a UDF, and then I die a little bit inside. (That said, the StreamingQuantile function that I adapted from Sawzall and contributed to DataFu is arguably the most useful thing I’ve done at Cloudera.)
I’ve always held a strong bias that education is most effective when the student learns by doing. As a developer of technical curricula, my goal is to have training participants engage with real and relevant problems as much as possible through hands-on exercises. The high rate at which Apache Hadoop is changing, both as a technology and as an ecosystem, makes developing Cloudera training courses not only demanding but also seriously fun and rewarding.
I recently undertook the challenge of upgrading the Cloudera Administrator Training for Apache Hadoop. I more than quadrupled the amount of hands-on exercises from the previous version, adding a full day to the course. At four days, it’s now the most thorough training for Hadoop administrators and truly the best way to start building expertise.
While developing the course, I collaborated with some of the most knowledgeable Hadoop administrators I could find, including Eric Sammer, Amandeep Khurana, Kathleen Ting, Romain Rigaux, and many other smart folks at Cloudera. The upgrades to the curriculum and exercises are based on best practices used to resolve our customers’ biggest problems. These insights resonate throughout the course, including the determination that administrators should learn installation, configuration, maintenance, monitoring, and troubleshooting using the standard Hadoop tools. Although we certainly hope that Hadoop users take advantage of Cloudera Manager to simplify and streamline many of these tasks, we believe that every good administrator needs to first take a look under the hood and tinker with Hadoop’s guts. There’s no replacement for get-your-hands-dirty experience to achieve expertise.
Cluster in the Cloud
In December 2012, we described how an internal application built on CDH called Cloudera Support Interface (CSI), which drastically improves Cloudera’s ability to optimally support our customers, is a unique and instructive use case for Apache Hadoop. In this post, we’ll follow up by describing two new differentiating CSI capabilities that have made Cloudera Support yet more responsive for customers:
Apache HBase is all about giving you random, real-time, read/write access to your Big Data, but how do you efficiently get that data into HBase in the first place? Intuitively, a new user will try to do that via the client APIs or by using a MapReduce job with TableOutputFormat, but those approaches are problematic, as you will learn below. Instead, the HBase bulk loading feature is much easier to use and can insert the same amount of data more quickly.
This blog post will introduce the basic concepts of the bulk loading feature, present two use cases, and propose two examples.
Overview of Bulk Loading
If you have any of these symptoms, bulk loading is probably the right choice for you:
Why would any company be interested in searching through its vast trove of email? A better question is: Why wouldn’t everybody be interested?
Email has become the most widespread method of communication we have, so there is much value to be extracted by making all emails searchable and readily available for further analysis. Some common use cases that involve email analysis are fraud detection, customer sentiment and churn, lawsuit prevention, and that’s just the tip of the iceberg. Each and every company can extract tremendous value based on its own business needs.
A little over a year ago we described how to archive and index emails using HDFS and Apache Solr. However, at that time, searching and analyzing emails were still relatively cumbersome and technically challenging tasks. We have come a long way in document indexing automation since then — especially with the recent introduction of Cloudera Search, it is now easier than ever to extract value from the corpus of available information.
In December 2012, while Cloudera Impala was still in its beta phase, we provided a roadmap for planned functionality in the production release. In the same spirit of keeping Impala users, customers, and enthusiasts well informed, this post provides an updated roadmap for upcoming releases later this year and in early 2014.
But first, a thank-you: Since the initial beta release, we’ve received a tremendous amount of feedback and validation about Impala — copious in its quality as well as quantity. At least one person in approximately 4,500 unique organizations around the world have downloaded the Impala binary, to date. And even after only a few months of GA, we’ve seen Cloudera Enterprise customers from multiple industries deploy Impala 1.x in business-critical environments with support via a Cloudera RTQ (Real-Time Query) subscription — including leading organizations in insurance, banking, retail, healthcare, gaming, government, telecom, and advertising.
Furthermore, based on the reaction from other vendors in the data management space, few observers would dispute the notion that Impala has made low-latency, interactive SQL queries for Hadoop as important a customer requirement as the high-latency, batch-oriented SQL queries enabled by Apache Hive. That’s a great development for Hadoop users everywhere!
What Was Delivered in Impala 1.0/1.1
This week’s Cloudera Sessions roadshow will make it to Denver, Colo., on Thursday, where the customer Fireside Chat will feature Intelligent Software Solutions (ISS) Chief Architect of Global Enterprise Solutions, Wes Caldwell. ISS helps many government organizations – including several within the U.S. Department of Defense — deploy next-generation data management and analytic solutions using a combination of systems integration expertise and custom-built software.
During the Fireside Chat, Cloudera’s COO Kirk Dunn will engage Wes in a conversation to discuss the business use cases for Hadoop that ISS sees most often in the field, primarily within two buckets: batch analytics and real-time applications. Wes will also share his thoughts on some of the more recent innovations within the Apache Hadoop ecosystem, such as Cloudera Impala and Solr integrations.
If you are local to the Denver area, it’s not too late to register for Thursday’s event. If you’re planning to attend and would like to suggest topics or questions for discussion, particularly during the Fireside Chat with Wes, comment here or tweet using hashtag #ClouderaSessions.