Cloudera Blog · General Posts
Apache Oozie, the workflow coordinator for Apache Hadoop, has actions for running MapReduce, Apache Hive, Apache Pig, Apache Sqoop, and
Distcp jobs; it also has a Shell action and a Java action. These last two actions allow us to execute any arbitrary shell command or Java code, respectively.
In this blog post, we’ll look at an example use case and see how to use both the Shell and Java actions in more detail. Please follow along below; you can get a copy of the full project at Cloudera’s GitHub as well. This how-to assumes some basic familiarity with Oozie.
Example Use Case
Suppose we’d like to design a workflow that determines which earthquakes from the last 30 days have a magnitude greater than or equal to that of the largest earthquake in the last hour; also, we’d like to run this workflow every hour. One last requirement for our workflow is that in order to save bandwidth and time, we’d like to be able to skip downloading and processing the 30 days of earthquake data if there were no “large” earthquakes within the last hour; because “large” is subjective, we’ll just go with 3.2 for this example but we should make this easy to configure.
The following guest post is provided by Aaron Kimball, CTO of WibiData.
The Kiji ecosystem has grown with the addition of a new module, KijiMR. The Kiji framework is a collection of components that offer developers a handle on building Big Data Applications. In addition to the first release, KijiSchema, we are now proud to announce the availability of a second component: KijiMR. KijiMR allows KijiSchema users to use MapReduce techniques including machine-learning algorithms and complex analytics to develop many kinds of applications using data in KijiSchema. Read on to learn more about the major features included in KijiMR and how you can use them.
KijiMR offers developers a set of new processing primitives explicitly designed for interacting with complex table-oriented data. The low-level batch interfaces available in MapReduce include basic InputFormat and OutputFormat implementations. The raw APIs are designed for processing key-value pairs stored in flat files in HDFS. Integrating MapReduce with HBase via InputFormat and OutputFormat APIs is hard to do from scratch in every algorithm. In KijiMR, we have extended the available MapReduce APIs to include:
Hue lets you interact with Hadoop services from within your browser without having to go to a command-line interface. It features a file browser for HDFS, an Apache Oozie Application for creating workflows of data processing jobs, a job designer/browser for MapReduce, Apache Hive and Cloudera Impala query editors, a Shell, and a collection of Hadoop APIs.
The goal of this release was to add a set of new features and improve the user experience. Read on for a list of the major changes (from 304 commits).
Today is an exciting day for Cloudera customers and users. With an update to our 100% open source platform and a number of new add-on products, every software component we ship is getting either a minor or major update. There’s a lot to cover and this blog post is only a summary. In the coming weeks we’ll do follow-on blog posts that go deeper into each of these releases.
We’re now supporting several hundred production Hadoop clusters. In doing so we’ve had to make a lot of advances in the functionality, reliability and manageability of the Hadoop platform. Even with these improvements, customers have been traditionally reluctant to run certain data and applications on the Apache Hadoop platform. The new products we are announcing today were designed to remove these obstacles to adoption.
Last week the Apache Hadoop PMC voted to release Apache Hadoop 2.0.3-alpha, the latest in the Hadoop 2 release series. This release fixes over 500 issues (covering the Common, HDFS, MapReduce and YARN sub-projects) since the 2.0.2-alpha release in October last year. In addition to bug fixes and general improvements the more noteworthy changes include:
Last Thursday, I had the pleasure of attending the Crunchies with Alan Saldich (our VP of Marketing) and Sarah Mustarde (our Senior Director of Corporate Marketing). The Crunchies is an awards event a la “the Oscars” but for startups.
Cloudera was nominated for the “Sexiest Enterprise Startup” award. This was the 6th Annual Crunchies but the first time that TechCrunch had a slot for enterprise software, as Aneel Bhusri emphasized during the intro for the enterprise award. Historically, the Crunchies focused on social and mobile startups, aka the non-sexy startups :)
The following is a series of stories from people who in the recent past worked as Engineering Interns at Cloudera. These experiences concretely illustrate how collaboration between commercial companies like Cloudera and academia, such as in the form of these internships, helps promote big data research at universities. (These experiences were previously published in the ACM student journal, XRDS.)
Yanpei Chen (Intern 2011)
I Interned with Cloudera during my last summer of grad school. My dissertation was on “Workload Driven Design and Evaluation of Large-Scale Data-Centric Systems”, and I already had collaborations with Facebook and NetApp, two other big data companies. The goal of my work was to develop and demonstrate a set of empirical, workload-driven design and evaluation methods that complemented the traditional, subjective approach of designing by intuition and experience. It was very important that these methods generalized across many types of customer workloads. Hence, when Cloudera offered me an internship, I leapt at the unique opportunity to collect insights from customers in traditional industries who were still dealing with big data.
Thanks to Stripe’s Colin Marc (@colinmarc) for the guest post below, and for his work on the world’s first Ruby client for Cloudera Impala!
Like most other companies, at Stripe it has become increasingly hard to answer the big and interesting questions as datasets get bigger. This is pretty insidious: the set of potential interesting questions also grows as you acquire more data. Answering questions like, “Which regions have the most developers per capita?” or “How do different countries compare in how they spend online?” might involve hours of scripting, waiting, and generally lots of lost developer time.
Up to now, the answer has often been Apache Hive, which at least made it easy to express many of these queries. Unfortunately, Hive queries are typically very slow. Cloudera Impala provides a similar front-end while being orders of magnitude faster, and we’ve found it immensely useful in many different situations at Stripe. With the near real-time results, the notion of performing programmatic (and not just ad-hoc) queries has now become more attractive.
Programmatic Access with Ruby
Join us February 26 – 28 at Strata Santa Clara and learn how Cloudera has developed the de facto Platform for Big Data. Visit Cloudera in booth 701 to hear from our team in a series of presentations and partner-integrated demonstrations – agenda coming soon. We will also be hosting several celebrated authors of the Big Data community who will be available to sign copies of their published works and converse with you about the Big Data environment and specific projects within the Big Data space. Not only will our booth be packed with a full speaker and demonstration line-up and several authors, you will also have the opportunity to meet Doug Cutting. Doug helped found the Hadoop project and coined the project “Hadoop” after his son’s stuffed elephant.
Within the Strata Santa Clara event agenda look for Cloudera speakers presenting a number of topics:
Tuesday, February 26th
The Hadoop Community is an invariably fascinating world. After all, as Clouderan ATM put it in a past blog post, the user group meetups are adorably called “HUGs.” Just as the Cloudera blog has introduced you to some of the engineers, projects, and applications that serve as the head, heart, and hands of the Hadoop Community, we’re proud to add the circulatory system (to extend the metaphor), made up of Cloudera’s expert trainers and curriculum developers who bring Hadoop to new practitioners around the world every week.
Welcome to the first installment of our “Meet the Instructor” series, in which we briefly introduce you to some of the individuals endeavoring to teach Hadoop far and wide. Today, we speak to Jesse Anderson (@jessetanderson)!
What is your role at Cloudera?
I joined Cloudera about a year ago as a curriculum developer and instructor. I get the best of both worlds in educational services: I create and improve existing curriculum, such as the Cloudera Manager series, and I travel to teach the courses.