Cloudera Developer Blog · Pig Posts
This is a guest post contributed by Dmitriy Ryaboy (@squarecog) and was originally published in his blog on December 19th. We thought the information was valuable enough that it was worth reposting to spread the word even further.
Our vision for Hadoop World is a conference where both newcomers and experienced Hadoop users can learn and be part of the growing Hadoop community.
We are also offering training sessions for newcomers and experienced Hadoop users alike. Whether you are looking for an Introduction to Hadoop, Hadoop Certification, or you want to learn more about related Hadoop projects we have the training you are looking for.
With the recent release of CDH3b2, many users are more interested than ever to try out Cloudera’s Distribution for Hadoop (CDH). One of the questions we often hear is, “what does it take to migrate?”.
If you’re not familiar with CDH3b2, here’s what you need to know.
All versions of CDH provide:
Announcing Two New Training Classes from Cloudera: Introduction to HBase and Analyzing Data with Hive and Pig
Cloudera is pleased to announce two new training courses: a one-day Introduction to HBase and a two-day session on Analyzing Data with Hive and Pig. These join a recently-expanded two-day Hadoop for Administrators course and our popular three-day Hadoop for Developers offering, any of which can be combined to provide extensive, customized training for your organization. Please contact firstname.lastname@example.org for more information regarding on-site training, or visit www.cloudera.com/hadoop-training to view our public course schedule.
Cloudera’s HBase course discusses use-cases for HBase, and covers the HBase architecture, schema modeling, access patterns, and performance considerations. During hands-on exercises, students write code to access HBase from Java applications, and use the HBase shell to manipulate data. Introduction to HBase also covers deployment and advanced features.
Our Hive and Pig course is designed for developers who are skilled with SQL or scripting languages, but who are not Java experts. Hive and Pig are two approaches which allow non-Java programmers to access and manipulate massive amounts of data while abstracting away the complexities of MapReduce. Hive offers an SQL-like interface, while Pig’s scripting language, named PigLatin, is very easy for developers learn. This course covers both technologies, and includes multiple hands-on exercises to reinforce key concepts.
Hadoop has emerged as an indispensable component of any data-intensive enterprise infrastructure. In many ways, working with large datasets on a distributed computing platform (powered by commodity hardware or cloud infrastructure) has never been easier. But because customers are running clusters consisting of hundreds or thousands of nodes, and are processing massive quantities of data from production systems every hour, the logistics of efficient platform utilization can quickly become overwhelming.
To deal with this challenge, the Yahoo! engineering team created Oozie – the Hadoop workflow engine. We are pleased to provide Oozie with Cloudera’s distribution for Hadoop starting with the beta-2 release.
Why create a new workflow system?
You might wonder why a new workflow system is necessary for Hadoop, given that there are quite a few existing commercial and open-source systems available. While it is possible to use existing general-purpose workflow systems with Hadoop, it is anything but simple. Intricacies such as monitoring long running jobs and interfacing with the distributed file system require extensive work to port general workflow systems to the Hadoop environment. Oozie, on the other hand, is designed specifically for the Hadoop platform and uses it as its execution environment. It has built-in support for Hadoop tasks and integrates with this environment cleanly. Oozie itself is fairly light-weight, requires minimal configuration, and scales linearly – thus offering a sustainable approach to building workflows in the Hadoop environment.
CDH3 beta 2 includes Apache Pig 0.7.0, the latest and greatest version of the popular dataflow programming environment for Hadoop. In this post I’ll review some of the bigger changes that went into Pig 0.7.0, describe the motivations behind these changes, and explain how they affect users. Readers in search of a canonical list of changes in this new version of Pig should consult the Pig 0.7.0 Release Notes as well as the list of backward incompatible changes.
The biggest change to appear in Pig 0.7.0 is the complete redesign of the LoadFunc and StoreFunc interfaces. The Load-Store interfaces were first introduced in version 0.1.0 and have remained largely unchanged up to this point. Pig uses a concrete instance of the LoadFunc interface to read Pig records from the underlying storage layer, and similarly uses an instance of the StoreFunc interface when it needs to write a record. Pig provides different LoadFunc and StoreFunc implementations in order to support different storage formats, and since this is a public interface users may provide their own implementations as well.
The primary motivation for redesigning these interfaces is to bring them into closer alignment with Hadoop’s InputFormat and OutputFormat interfaces, with the goal of making it much easier to write new LoadFunc and StoreFunc implementations based on existing Hadoop InputFormat and OutputFormat classes. At the same time the new interfaces were also made a lot more powerful by providing direct access to configurations as well as the ability to selectively read individual columns.
We’re proud to announce that Cloudera’s Distribution for Hadoop Version 2 (CDH2) is officially released.
We’ve come a long way to get to a production quality release. At the beginning of September we announced the first beta of CDH2. After 6 months of additional testing we announced a release candidate. The release candidate spent over a month hardening in Cloudera’s internal QA process and on a wide variety of customer clusters. CDH2 is now stable and ready for use – we are pleased to recommend it to all our production users.
CDH2 is based on Apache Hadoop 0.20 – a release that has been available for almost a year. During this time, the Apache Hadoop community has produced hundreds of bug fixes, improvements and features. Cloudera is proud to have contributed many of these and incorporated them into CDH2. For more information, please review the following resources:
In September 2009, we announced the first release of CDH2, our current testing repository. Packages in our testing repository are recommended for people who want more features and are willing to upgrade as bugs are worked out. Our testing packages pass unit and functional tests but will not have the same “soak time” as our stable packages. A testing release represents a work in progress that will eventually be promoted to stable. It’s a long road of feedback, bug fixes, QA and testing to move from testing to stable. As someone who tracks the maturity of a testing build throughout its life cycle, I’m pleased to say we’ve put a lot of polish into this release.
At the beginning of September, we announced the first release of CDH2, our current testing repository. Packages in our testing repository are recommended for people who want more features and are willing to upgrade as bugs are worked out. Our testing packages pass unit and functional tests but will not have the same “soak time” as our stable packages. A testing release represents a work in progress that will eventually be promoted to stable.
We plan on pushing new packages into the testing repository every 3 to 6 weeks. And it just so happens it is just about 3 weeks after we announced the first testing release. So it must be time for a new one. Here are some of the highlights: