How-to: Develop CDH Applications with Maven and Eclipse

Learn how to configure a basic Maven project that will be able to build applications against CDH

Apache Maven is a build automation tool that can be used for Java projects. Since nearly all the Apache Hadoop ecosystem is written in Java, Maven is a great tool for managing projects that build on top of the Hadoop APIs. In this post, we’ll configure a basic Maven project that will be able to build applications against CDH (Cloudera’s Distribution Including Apache Hadoop) binaries.

Maven projects are defined using an XML file called pom.xml, which describes things like the project’s dependencies on other modules, the build order, and any other plugins that the project uses. A complete example of the pom.xml described below, which can be used with CDH, is available on Github. (To use the example, you’ll need at least Maven 2.0 installed.) If you’ve never set up a Maven project before, you can get a jumpstart by using Maven’s quickstart archetype, which generates a small initial project layout. Choose a group ID (typically a top-level package name) and an artifact ID (the name of the project), and execute the following command with the groupId and artifactIdarguments filled in:

        mvn archetype:generate \
      -DarchetypeGroupId=org.apache.maven.archetypes \
      -DarchetypeArtifactId=maven-archetype-quickstart \
      -DgroupId=<your-group-id> \
      -DartifactId=<your-project-name>

 

There will be a couple of prompts for information, but you can safely just hit enter till you see the build succeed. This will create a new directory with the name you chose as the artifact ID. In that directory will be a pom.xml file, and a src directory. Since the most important part of a Maven project is the pom.xml, we’ll focus on what goes in there. Right now, this pom.xml is a bit minimalistic. It has some high-level project metadata, a properties section, and a dependencies section, with a single dependency on the JUnit test framework. Since we want to use this project for Hadoop development, we need to add some dependencies on the Hadoop libraries. Maven resolves dependencies by downloading JAR files from remote repositories, like Maven Central Repository, but none of the default repositories include CDH, so we need to add a repository. The repository is declared in the pom.xml within the top-level projectsection like this:

<repositories>
      <repository>
        <id>cloudera-releases</id>
        <url>https://repository.cloudera.com/artifactory/cloudera-repos</url>
        <releases>
          <enabled>true</enabled>
        </releases>
        <snapshots>
          <enabled>false</enabled>
        </snapshots>
      </repository>
    </repositories>

 

This instructs Maven to pull any Hadoop binaries from the Cloudera repository, and now we can declare a dependency on Hadoop JARs. You can find all the Maven dependencies that are available from Cloudera, including Hadoop, Apache HBase, and the rest of the CDH components here. In order to specify a project dependency, you’ll add a dependency element to the dependencies section of the pom.xml:

<dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-client</artifactId>
      <version>2.0.0-mr1-cdh4.0.1</version>
    </dependency>

 

A project with the above dependency would compile against the CDH4 MapReduce v1 library. In practice, it’s good to declare the version string as a property, since there is a high likelihood of dependencies on more than one Maven artifact with the same version. The property can be declared in the properties section of the pom.xml:

<hadoop.version>2.0.0-mr1-cdh4.0.1</hadoop.version>

 

The name that is chosen for the property can then be referenced in other sections of the pom.xml. So, if we were to specify the hadoop.versionproperty, we can change our hadoop-client dependency to look like this:

<dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-client</artifactId>
      <version>${hadoop.version}</version>
    </dependency>

 

Now, whenever we want to upgrade our code to a new CDH version, we only need to change the version string in one place, at the top of the pom.xml. Since Hadoop requires at least Java 1.6, we should also specify the compiler version for Maven to use by enabling the compiler plugin in the top-level projectsection:

    <build>
      <pluginManagement>
        <plugins>
          <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-compiler-plugin</artifactId>
            <version>2.3.2</version>
            <configuration>
              <source>1.6</source>
              <target>1.6</target>
            </configuration>
          </plugin>
        </plugins>
      </pluginManagement>
    </build>

 

This gets us to a point where we’ve got a fully functional project, and we can build a JAR by running mvn install. However, the JAR that gets built does not contain the project dependencies within it. This is fine, so long as we only require Hadoop dependencies, since the Hadoop daemons will include all the Hadoop libraries in their own classpaths. If the Hadoop dependencies are not sufficient, it will be necessary to package the other dependencies into the JAR. We can configure Maven to package a JAR with dependencies by adding the following XML block to the buildsection:

<plugins>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-shade-plugin</artifactId>
        <version>1.7.1</version>
        <executions>
          <execution>
            <phase>package</phase>
            <goals>
              <goal>shade</goal>
            </goals>
          </execution>
        </executions>
      </plugin>
    </plugins>

 

When executing the mvn packagecommand, the above declarations instruct Maven to package all the dependencies into a JAR file. However, the JAR now contains all the Hadoop libraries, which would conflict with the Hadoop daemons’ classpaths. We can indicate to Maven that certain dependencies need to be downloaded for compile-time, but will be provided to the application at runtime by augmenting the Hadoop dependencies:

    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-client</artifactId>
      <version>${hadoop.version}</version>
      <scope>provided</scope>
    </dependency>

 

Remember to only add the provided scope to dependencies that you do not want included in the JAR. Maven also has very tight integration with a number of IDEs, such as Eclipse, NetBeans IDE, and IntelliJ IDEA. With Eclipse, the integration comes in two forms: by generating Eclipse artifacts through Maven and importing a project into Eclipse, or by using the m2eclipse Eclipse plugin, which allows you to modify a pom.xmlor run Maven builds from within Eclipse. A project can be setup to integrate with Eclipse by adding the following declarations to the pluginssection:

<plugin>
      <groupId>org.apache.maven.plugins </groupId>
      <artifactId>maven-eclipse-plugin</artifactId>
      <version>2.9</version>
      <configuration>
        <projectNameTemplate>
          ${project.artifactId}
        </projectNameTemplate>
        <buildOutputDirectory>
          eclipse-classes
        </buildOutputDirectory>
        <downloadSources>true</downloadSources>
        <downloadJavadocs>false</downloadJavadocs>
      </configuration>
    </plugin>

 

In order to generate the files necessary to import projects into Eclipse, you’ll need to run the following command:

mvn -Declipse.workspace=<eclipse-workspace-path>   eclipse:configure-workspace eclipse:eclipse

 

This command will generate an Eclipse .project file. You can import the project into Eclipse by selecting File -> Import, and then choosing Existing Projects Into Workspace from the General dropdown. Browse to the root directory of the project, and click OK. You should see the available projects listed in the checkbox. Select the projects you want to import, and then click Finish to complete the import. Maven will set up the classpath for the Eclipse project, so all the JARs that you have referenced as dependencies in the <projectNameTemplate>${project.artifactId}</projectNameTemplate> should show up under Referenced Libraries in the Eclipse project. If you add more dependencies later, the Eclipse files can be regenerated by running the mvn eclipse:eclipse command, and refreshing the project in Eclipse.

For more information about Maven, and Maven documentation, see http://maven.apache.org.

Jon Natkins (@nattybnatkins) is a Software Engineer at Cloudera, where he worked on Cloudera Manager and Hue, and has contributed patches to Hive and Hadoop. Prior to Cloudera, Jon was an engineer and database wrangler at Vertica. He holds an Sc.B in Computer Science from Brown University.

Filed under:

2 Responses
  • Daniel / September 02, 2012 / 7:17 AM

    Great article! I kinda solved the problem of properly configuring maven dependencies just a week before. I didn’t know about the shade plugin though!
    I would love to see a similar article on how to configure both maven and eclipse for developing and building Pig UDFs, because I have a feeling my solutions in that area are kinda hackish…

  • Jon Natkins / September 04, 2012 / 10:50 AM

    Hi Daniel,

    You should be able to follow the same procedure for dealing with Pig UDFs. You just need to make sure you include the appropriate Pig Maven dependencies that you need (the list of available dependencies is here: https://ccp.cloudera.com/display/CDH4DOC/Using+the+CDH4+Maven+Repository)

    My guess is you’ll want:

    <dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>pig</artifactId>
    <version>${pig.version}</version>
    <scope>provided</scope>
    </dependency>

    You can do the same thing with the shade plugin to create an uberjar (but use the provided scope for jars that already exist in the Pig classpath), and use the REGISTER command to pull in the jars for Pig scripts.

Leave a comment


+ 7 = fourteen