One Engineer’s Experience with Parcel

We’re very pleased to bring you this guest post from Verisign engineer Benoit Perroud, which is based on his personal experiences with the new “Parcel” binary distribution format in Cloudera Manager 4.5.

Among all the new features released with Cloudera Manager 4.5, Parcel is probably one of the most unnoticed – despite the fact it has the potential to become the administrator’s best friend.

Parcel is a new package format to easily distribute CDH or other custom packages to all nodes in a cluster. A parcel is basically a monolithic gzip-compressed tarball file with some additional metadata, bundling one or more components.

In this post we will dig into the Parcel format, explore how it is used in Cloudera Manager, and explore a concrete example that demonstrates how easy and powerful this new format is. (Note: the Parcel format is most likely subject to change and the description given here is a combination of reverse engineering and help from Philip Langdale of Cloudera. See this FAQ for more background.)

Parcel’s Internals

First, let’s dig into Parcel’s internals.

As previously mentioned, a parcel is a gzip-compressed tarball file with some additional metadata. The parcel will be deployed in $PARCELS_ROOT directory, /opt/cloudera/parcels by default.

From its folder, a parcel relies heavily on alternatives to create symlinks of the binaries and configuration files in Linux common folders like /usr/bin, /etc, and /usr/share  when the parcel is activated.

All the metadata are stored in a mandatory folder called “meta.” It will comprise two main configuration files: parcel.json and permissions.json. Any additional directory structure can be added without restriction. Common examples include a lib folder containing jar and native libraries, etc folder containing configuration files, share folder containing documentation, and so on.

Proposed parcel directory structure

parcel.json

The parcel.json file holds a large portion of the parcel’s configuration. It is obviously JSON formatted. The following nonexhaustive properties will be included in the file:


Attribute name

Attribute type

Description

name

String

Name of the parcel

version

String

Version of the parcel

setActiveSymlink

boolean

Tells if parcel activation also creates a symlink without the version in $PARCELS_ROOT.

scripts

Object

This object references the scripts executed at activation and at runtime. Scripts listed here are relative to the meta folder.

    defines

String

Environment scripts, usually export a couple of environment variables.

    alternatives

String

Script to be used to set symlinks (alternatives) when the parcel is activated

packages

List of Objects

List the packages installed with this parcel. Note here that the difference between packages and components seems to be for display only in Cloudera Manager.

    name

String

Name of the package

    version

String

Version of the package

components

List of Objects

List of components installed with this parcel, explicitly listed in Cloudera Manager.

    name

String

Name of the component

    version

String

Version of the component

users

Object of Object

Define the users that will be created by this parcel. The name of the attribute is the username to create.

groups

List of strings

List of groups to create

As a detailed example, the parcel.json file in the CDH-4.2.1 parcel looks like this:

 

permissions.json

The permissions.json file defines custom permissions to apply to given files. The files are relative to the parcel root folder:

 

Parcel Repository

A parcel is hosted on a HTTP web server, and the remote parcel repository URL is added into Cloudera Manager’s configuration.

Adding a new remote parcel repository in Cloudera Manager’s configuration

The repository must have a file, manifest.json, which will be read by Cloudera Manager. This file will contain a timestamp remembering the latest update and the parcels available in the repository. Finally, the parcel name needs to include the distro for which it is built: el5 for RHEL 5.x, el6 for RHEL 6.x, precise for Ubuntu 12.04 LTS Precise Pangolin, lucid for Ubuntu 10.04 LTS Lucid Lynx, and so on.

 

The Parcel Life-cycle

Once the remote parcel repository is added to Cloudera Manager, the administrator can download the new parcels to Cloudera Manager’s local parcel repository (/opt/cloudera/parcel-repo), from where it can be distributed to all nodes of the cluster.

When the admin decides to distribute the parcel, every node starts downloading the file from the Cloudera Manager server (no need for nodes to have internet access). For large clusters, this will generate significant network traffic, but fortunately the number of concurrent uploads can be defined. Once the file transfer is done, the parcel is untar’ed on the nodes in $PARCELS_ROOT.

No symlink will be created at that point. Symlinks are only created when the cluster is restarted with the parcel activated. And of course, parcels can be removed from all the nodes and deleted from the local repository.

Creating Your Own Parcels

One of the biggest pains we experienced with Hadoop at Verisign is LZO compression. Going back to 2009, LZO was the reasonable solution to compress data with decent decompression speed, and to turn on split-ability. Assuming you have LZO-compressed files in your cluster, you want to continue to give your data scientists the ability to process the files. Below is a proposed approach for using a parcel to install LZO to your cluster.

Due to its license (GPL v2+), LZO can’t be embedded into Apache projects and thus needs to be shipped separately. LZO native libraries need to be broadcasted to every node of the cluster to be referenced into the java.native.library options of the TaskTracker in order to be loaded by mapper and reducer tasks. You could bundle the libraries with your job and hack some code to include the LZO native libraries in the java.native.library at runtime (see code snippet 2 below), but starting with such tricks is opening the door to non-recommended practices (and remember the broken windows theory). Moreover, if you’re extensively using Hadoop Streaming, you want com.hadoop.mapred.DeprecatedLzoTextInputFormat to be available as inputformat parameter:

 

LZO native and java libraries should be sent to every node of the cluster and the directories added respectively into the java.native.library and the classpath of the TaskTracker. Puppet is a good fit here but rather than digressing, let’s build a parcel enabling LZO compression to the cluster.

Building Hadoop LZO

The first step consists of building Hadoop LZO. Hadoop LZO is the Java library bundling the LZO native libraries and the JNA interface. It also contains the utilities to index the LZO files, a mandatory step in making it split-able.

Hadoop LZO is not yet compatible with Hadoop 2.0, so a bit of hacking is required to make it work. Or you can simply clone a fork where the modification is already done. Below are the two main changes that have been done to make the code 2.0-compatible:

org.apache.hadoop.io.compress.Decompressor interface has an additional function getRemaining() and org.apache.hadoop.io.compress.BlockDecompressorStream.getCompressedData returns an int instead of void.

Code change to be 2.0-compatible

Create the Parcel

The parcel will contain the hadoop-lzo jar library, the native library, and the appropriate metadata.

Parcel content

The parcel.json file will have the following content:

 

And the lzo_env.sh will be the following:

 

Note that no alternatives script is shipped because there is no binary file inside.

The complete contents of this parcel can be found here.

Deploying the Parcel

parcel is uploaded in a HTTP web server directory, you need to create the manifest.json. The timestamp is the epoch time in deci-milliseconds:

 

Cloudera’s official manifest file is always a great source of inspiration.

Configuring Hadoop via Cloudera Manager

In Cloudera Manager, the configuration steps should be minimal. The only required step is to add the appropriate io.compression.codec class to have the .lzo extension recognized. To do that, add the following value to “MapReduce Service Configuration Safety Valve for mapred-site.xml” property:

 

That should do it. Unfortunately, due to a bug in Cloudera Manager version < 4.6, environment scripts are not executed at components startup –  the HADOOP_CLASSPATH set in lzo_env.sh is in fact not set appropriately. The workaround is to add the content of the script in the property “MapReduce Service Environment Safety Valve”:

 

Restart Your Service

We’re done: Simply restart the MapReduce service, and LZO compression will be activated!

Because often someone else has already done the work for you, you can also simply add a remote repository in Cloudera Manager pointing to Cloudera’s official GPL Extra repository at http://archive.cloudera.com/gplextras/parcels/latest/ to have the official HADOOP_LZO parcel (including Impala’s drivers). Or, you can point to http://killerwhile.github.io/parcels-repo/repo where alongside of Hadoop_lzo package, Elephant-Bird (CDH4-compatible) is also available.

Enjoy!

Benoit Perroud is Software Engineer at Verisign Inc., developing and scaling the companywide offline data processing platform – based on Hadoop infrastructure. He is also an Apache Committer, NoSQL enthusiast, and frequent speaker at Swiss and other European tech conferences.


Learn More About Parcels

Want to see the power of Parcels in action? Watch our e-learning module on Understanding Parcels to learn the fundamentals of optimizing your Hadoop operations with Parcels. The video includes a step-by-step demo of upgrading CDH and installing Impala, Search, and Hadoop LZO.

Filed under:

1 Response
  • Ujjwal W / March 06, 2014 / 8:48 AM

    Thanks for putting together this detailed article. I noticed that structure of the meta folder has changed in CDH5 beta 2. The above mentioned approach and explanation was valid upto CDH5 beta 1.

    From a first look it seems that alternatives.sh has changed to alternatives.json for creating symlinks.

    Can you explain the new approach or may be edit this article for a comparison ?

    Thanks,
    @wadujj

Leave a comment


nine − 7 =