How-to: Configure Eclipse for Hadoop Contributions

Contributing to Apache Hadoop or writing custom pluggable modules requires modifying Hadoop’s source code. While it is perfectly fine to use a text editor to modify Java source, modern IDEs simplify navigation and debugging of large Java projects like Hadoop significantly. Eclipse is a popular choice thanks to its broad user base and multitude of available plugins.

This post covers configuring Eclipse to modify Hadoop’s source. (Developing applications against CDH using Eclipse is covered in a different post.) Hadoop has changed a great deal since our previous post on configuring Eclipse for Hadoop development; here we’ll revisit configuring Eclipse for the latest “flavors” of Hadoop. Note that trunk and other release branches differ in their directory structure, feature set, and build tools they use. (The EclipseEnvironment Hadoop wiki page is a good starting point for development on trunk.)

This post covers the following main flavors:

  • The traditional implementation of MapReduce based on the JobTracker/TaskTracker architecture (MR1) running on top of HDFS. Apache Hadoop 1.x and CDH3 releases, among others, capture this setup.
  • A highly-scalable MapReduce (MR2) running over YARN and an improved HDFS 2.0 (Federation, HA, Transaction IDs), captured by Apache Hadoop 2.x and CDH4 releases.
  • Traditional MapReduce running on HDFS-2 — that is, the stability of MR1 running over critical improvements in HDFS-2. CDH4 MR1 ships this configuration.

The below table captures the releases and the build tools they use along with the preferred version:

Release

Build Tool (preferred version)

CDH3 (Hadoop 1.x)

Ant (1.8.2)

CDH4 (Hadoop 2.x) HDFS

Maven (3.0.2)

CDH4 (Hadoop 2.x) MR2

Maven (3.0.2)

CDH4 MR1

Ant (1.8.2)

Other Requirements:

  • Oracle Java 1.6 or later
  • Eclipse (Indigo/Juno)

Setting Up Eclipse

  1. First, we need to set a couple of classpath variables so Eclipse can find the dependencies.
    1. Go to Window -> Preferences.
    2. Go to Java -> Build Path -> Classpath Variables.
    3. Add a new entry with name ANT_PATH and path set to the ant home on your machine, typically /usr/share/ant.
    4. Add another new entry with name M2_REPO and path set to your maven repository, typically $HOME/.m2/repository (e.g. /home/user/.m2/repository).

  2. Hadoop requires tools.jar, which is under JDK_HOME/lib. Because it is possible Eclipse won’t pick this up:
    1. Go to Window->Preferences->Installed JREs.
    2. Select the right Java version from the list, and click “Edit”.
    3. In the pop-up, “Add External JARs”, navigate to “JDK_HOME/lib”, and add “tools.jar”.

  3. Hadoop uses a particular style of formatting. When contributing to the project, you are required to follow the style guidelines: Java formatting with all spaces and indentation as well as tabs set to 2 spaces. To do that:
    1. Go to Window -> Preferences.
    2. Go to Java->Code Style -> Formatter.
    3. Import this Formatter.

    4. It is a good practice to enable automatic formatting of the modified code when you save a file. To do that, go to Window->Preferences->Java->Editor->Save Actions and select “Perform the selected actions on save”, “Format source code”, “Format edited lines”. Also, de-select “Organize imports”.

  4. For Maven projects, the m2e plugin can be very useful. To install the plugin, go to Help -> Install New Software. Enter “http://download.eclipse.org/technology/m2e/releases” into the “Work with” box and select  the m2e plugins and install them.

             

Configuration for Hadoop 1.x / CDH3

  1. Fetch Hadoop using version control systems subversion or git and checkout branch-1 or the particular release branch. Otherwise, download a source tarball from the CDH3 releases or Hadoop releases.
  2. Generate Eclipse project information using Ant via command line:
    1. For Hadoop (1.x or branch-1), “ant eclipse”
    2. For CDH3 releases, “ant eclipse-files”
  3. Pull sources into Eclipse:
    1. Go to File -> Import.
    2. Select General -> Existing Projects into Workspace.
    3. For the root directory, navigate to the top directory of the above downloaded source.

Configuration for Hadoop 2.x / CDH4 MR2

Apache Hadoop 2.x (branch-2/trunk based) and CDH4.x have the same directory structure and use Maven as the build tool.

  1. Again, fetch sources using svn/git and checkout appropriate branch or download release source tarballs (follow CDH Downloads).
  2. Using the m2e plugin we installed earlier:
    1. Navigate to the top level and run “mvn generate-sources generate-test-sources”.
    2. Import project into Eclipse:
      1. Go to File -> Import.
      2. Select Maven -> Existing Maven Projects.
      3. Navigate to the top directory of the downloaded source.

    3. The generated sources (e.g. *Proto.java files that are generated using protoc) might not be directly linked and can show up as errors. To fix them, select the project and configure the build path to include the java files under target/generated-sources and target/generated-test-sources. For inclusion pattern, select “**/*.java”.

  3. Without using the m2e plugin:
    1. Generate Eclipse project information using Maven: mvn clean && mvn install -DskipTests && mvn eclipse:eclipse. Note: mvn eclipse:eclipse generates a static .classpath file that Eclipse uses, this file isn’t automatically updated as the project/dependencies change.
    2. Pull sources into Eclipse:
      1. Go to File -> Import.
      2. Select General -> Existing Projects into Workspace.
      3. For the root directory, navigate to the top directory of the above downloaded source.

Configuration for CDH4 MR1

CDH4 MR1 runs the stable version of MapReduce (MR1) on top of HDFS from Hadoop 2.x branches. So, we have to set up both HDFS and MapReduce separately.

  1. Follow Steps 1 and 2 of the previous section (Hadoop 2.x).
  2. Download MR1 source tarball from CDH4 Downloads and untar into a folder different than the one from Step 1.
  3. Within the MR1 folder, generate Eclipse project information using Ant via command line (ant eclipse-files).
  4. Configure .classpath using this perl script to make sure all classpath entries point to the local Maven repository:
    1. Copy the script to the top-level Hadoop directory.
    2. Run $ perl configure-classpath.pl
  5. Pull sources into Eclipse:
    1. Go to File -> Import.
    2. Select General -> Existing Projects into Workspace.
    3. For the root directory, navigate to the top directory of the above downloaded sources.

Happy Hacking!

Karthik Kambatla is a Software Engineer at Cloudera in the scheduling and resource management team and works primarily on MapReduce and YARN.

Filed under:

4 Responses
  • Amith / May 30, 2013 / 9:40 PM

    Follow these steps to create your own eclipse plugin for any hadoop versions
    https://docs.google.com/document/d/1yuZ4IjlquPkmC1zXtCeL4GUNKT1uY1xnS_SCBJHps6A/edit?pli=1

  • RKAirani / December 17, 2013 / 8:30 AM

    Thanks Karthik. It is probably the simplest and best way of working on Hadoop contributions. However, for Hadoop 1.x versions following 2 downloads may be required:
    1. sudo apt-get install automake autoconf
    2. sudo apt-get install libtool

    Thanks a lot.

  • nagappa / January 23, 2014 / 2:11 AM

    is eclipse configured in latest Cloud-era VM ? Please answer me ?if not can configure in VM itself ?

Leave a comment


four × = 20