What’s New in CDH4.1 Pig

Categories: CDH Pig

Apache Pig is a platform for analyzing large data sets that provides a high-level language called Pig Latin. Pig users can write complex data analysis programs in an intuitive and compact manner using Pig Latin.

Among many other enhancements, CDH4.1, the newest release of Cloudera’s open-source Hadoop distro, upgrades Pig from version 0.9 to version 0.10. This post provides a summary of the top seven new features introduced in CDH4.1 Pig.

Boolean Data Type

Pig Latin is continuously evolving. As with other actively developed programming languages, more data types are being added to Pig. CDH4.1 adds the boolean type. The boolean type is internally mapped to the Java Boolean class, and the boolean constants ‘TRUE’ and ‘FALSE’ are case-insensitive. Here are some example uses of boolean type:

Note that if you have UDFs that implement the LoadCaster and StoreCaster interfaces in releases prior to CDH4.1, they will have to be modified to implement new methods called bytesToBoolean() and toBytes(), which were added to LoadCaster and StoreCaster respectively.


Pig now supports CROSS and FOREACH within a FOREACH statement in addition to already supported operators such as DISTINCT, FILTER, LIMIT, and ORDER BY. A nested FOREACH is particularly useful when you want to iterate through nested relations grouped by the same key. For example, if you want to compute the Cartesian product of two nested relations co-grouped by the same key and do some processing on them, you can achieve this in a concise manner as follows:

Note that only two levels of nesting are supported.

Ruby UDFs

Scripting UDFs are not new to Pig. In fact, Python and JavaScript have been supported since CDH3. But in CDH4.1, another popular scripting language is added: Ruby. Just as Python UDFs interact with Pig via Jython, Ruby UDFs do the same via JRuby internally.

To register Ruby UDFs in Pig scripts, the REGISTER command is used as follows:

Similar to Python UDFs, there are two ways of defining the return type of UDFs: defining it in Pig syntax with an ‘outputSchema’ decorator if the return type is static, or defining a ‘schemaFunction’ if the return type is dynamic. For more details on Ruby UDF decorators, please refer to the Pig 0.10 documentation.

LIMIT / SPLIT by Expression

Prior to CHD4.1, only constants were allowed in the expression of LIMIT and SPLIT. In CDH4.1, the expression may include scalar variables as well. Here is an example of getting the top 10% of records from a relation using LIMIT:

It is worth noting that only scalar variables can be used in the expression; column references cannot be used. In the above example, ‘c.sum’ is implicitly cast to a scalar since ‘c’ is a relation that contains a single long type record. A statement like e = LIMIT d val is not valid.

Default SPLIT Destination

The SPLIT operator is used to partition records of a relation into multiple relations. Prior to CDH4.1, records that did not meet any condition in the expressions were simply discarded. But in CDH4.1 it is possible to define the default destination via OTHERWISE as follows:

Note that this feature introduces a new keyword – OTHERWISE, and this may break existing Pig scripts if they use it as an alias.

Syntactical Sugar for TOTUPLE, TOBAG, and TOMAP

Syntactic sugar for TOTUPLE, TOBAG, and TOMAP has been added. Now Pig automatically converts ‘( )’, ‘{ }’, and ‘[ ]’ to tuple, bag, and map respectively.

AvroStorage Improvements

AvroStorage now supports path globbing. Since Hadoop’s path globbing is used internally, the syntax is the same as that of Hadoop:

AvroStorage can also load and store an Avro record type that has self-referencing fields. The problem with self-referencing fields is that it is not possible to convert them to Pig schema because recursive records are by definition infinite. As a workaround, Pig now converts them to bytearrays when detecting recursive records; therefore, it is possible to load and store an Avro record type that has self-referencing fields. Here is an example:

Note that a new option ‘no_schema_check’ is passed to both the load function and store function. This is necessary because by mapping recursive records to bytearrays, discrepancies between Avro and Pig schemas are introduced. Therefore, we must disable schema check or the load and store will fail with an exception during job compilation. For the store function, there are two ways to specify output schemas. Either it can be specified via the ‘schema’ option with a JSON string, or the schema of an existing Avro file can be used via the ‘same’ option.

There are more new features coming in a future release including but not limited to a date/time data type, the rank function, and a cube operator. These will let you write even more powerful and concise Pig scripts.

Please try out the CDH4.1 Pig today. It can be downloaded from the Downloads page.