CDH3 beta 2 includes Apache Pig 0.7.0, the latest and greatest version of the popular dataflow programming environment for Hadoop. In this post I’ll review some of the bigger changes that went into Pig 0.7.0, describe the motivations behind these changes, and explain how they affect users. Readers in search of a canonical list of changes in this new version of Pig should consult the Pig 0.7.0 Release Notes as well as the list of backward incompatible changes.
The biggest change to appear in Pig 0.7.0 is the complete redesign of the LoadFunc and StoreFunc interfaces. The Load-Store interfaces were first introduced in version 0.1.0 and have remained largely unchanged up to this point. Pig uses a concrete instance of the LoadFunc interface to read Pig records from the underlying storage layer, and similarly uses an instance of the StoreFunc interface when it needs to write a record. Pig provides different LoadFunc and StoreFunc implementations in order to support different storage formats, and since this is a public interface users may provide their own implementations as well.
The primary motivation for redesigning these interfaces is to bring them into closer alignment with Hadoop’s InputFormat and OutputFormat interfaces, with the goal of making it much easier to write new LoadFunc and StoreFunc implementations based on existing Hadoop InputFormat and OutputFormat classes. At the same time the new interfaces were also made a lot more powerful by providing direct access to configurations as well as the ability to selectively read individual columns.
In the short span of time since these new interfaces appeared the Pig community has responded by writing a variety of custom Loaders including ones for Cassandra, Voldemort, and Hive’s RCFile columnar storage format. It is important to note that these new plugins were written without any direct involvement from the Pig core team, which is a significant validation of the work that went into the redesign effort. A list of third-party Pig Loaders is maintained on the Pig Intoperability page. Users who are interested in writing their own LoadFuncs or StoreFuncs should first read the updated Load-Store HowTo.
If you are upgrading from an earlier version of Pig you need to be aware that the new Load/Store interfaces are not backward compatible with the old interfaces. Users who have written custom LoadFuncs or StoreFuncs that work with an earlier version will need to upgrade these functions to use the new interfaces. For more details about this process please consult the Load-Store Migration Guide on the Pig wiki.
Use the Distributed Cache to Improve Performance
Pig 0.7.0 includes a set of important performance enhancements that aim to make queries run faster by leveraging Hadoop’s Distributed Cache. The key observation that motivated these changes is that Pig query plans often involve directing a large number of tasks to read the same sample of data. One can observe this access pattern in the Fragment-Replicate Join, SkewedJoin, and GroupBy operators. Earlier versions of Pig read this data directly from the underlying distributed file system, an approach that is inefficient, but also has the potential to cause a cluster-wide failure if a large number of concurrent Map tasks swamp the NameNode with read requests. PIG-872 and PIG-1218 remedy this problem by loading the common data into the Distributed Cache. This allows tasks to perform a local disk read instead of having to wait while the data is retrieved from HDFS, and also allows tasks that run on the same node to share the same data.
Use Hadoop’s Local Mode for Pig Local Mode
One of things that has made Pig especially easy for new users to pick up is its support for a local mode that does not require an Hadoop installation. Unfortunately, maintaining this feature has turned into a major headache for the Pig developers as it requires a large body of custom code and execution paths that are not shared with the rest of the system. A direct consequence of this is that many of the new features that have been added to Pig do not work in local mode, and this has caused a lot of confusion within the Pig user community. Based on these factors the Pig developers decided that it made sense to replace Pig’s custom local mode implementation with one that depends on Hadoop’s local mode. This change benefits Pig users since they can now test a script in local mode and be confident that it will run correctly in distributed mode, or vice-versa. However, users should be aware that there is one unfortunate side-effect of this change: Pig now runs roughly an order of magnitude slower in local mode.
Making Pig 0.7.0 Even Better
Pig 0.7.0 was released in mid-May, and since that time several important patches have appeared for bugs that were found in the original release. These patches include a fix that allows UDFs to access counters, as well as another fix that adds a counter to track the number of output rows in each output file. I think you’ll be glad to hear that we have included these patches as well as others in the version of Pig 0.7.0 that is included in CDH3 beta 2.