How-to: Use Eclipse with MapReduce in Cloudera’s QuickStart VM

One of the common questions I get from students and developers in my classes relates to IDEs and MapReduce: How do you create a MapReduce project in Eclipse and then debug it?

To answer that question, I have created a screencast showing you how, using Cloudera’s QuickStart VM. The QuickStart VM helps developers get started writing MapReduce code without having to worry about software installs and configuration. Everything is installed and ready to go. You can download the image type that corresponds to your preferred virtualization platform.

Eclipse is installed on the VM and there is a link on the desktop to start it.

MapReduce and Eclipse

You can run and debug MapReduce code in Eclipse just like any other Java program. However, there are a few differences between running MapReduce in a distributed cluster and in an IDE like Eclipse. When you run MapReduce code in Eclipse, Hadoop runs in a special mode called LocalJobRunner, under which all the Hadoop daemons are run in a single JVM (Java Virtual Machine) instead of several different JVMs. Another difference is that all file paths default to local file paths, not HDFS ones.

With those caveats in mind, you can start putting in your breakpoints and debug your MapReduce code like any other Java program.

If you want to clone the same Git project as I do in the screencast, you can find it here. From the terminal, type in:

git clone https://github.com/eljefe6a/UnoExample.git

The project will be cloned to the current directory as a subdirectory.

Note that creating Eclipse projects manually is the easy way to get started.  If you are going to have Hadoop as part of an automated build process, you will want to do this in Maven. In Maven, you can create Eclipse projects — this blog post tells you how. If you want to compile Hadoop from source using Eclipse, this post shows you how.

Conclusion

Whether you to start writing some MapReduce code or debug existing code, the QuickStart VM will help you do it quickly and easily. This screencast walks you through it and gets you coding on your favorite IDE.

Further reading:

Jesse Anderson is an instructor with Cloudera University.

(Jesse just released a series of screencasts about Hadoop MapReduce. It’s published again by the good people at Pragmatic Programmers. These screencasts are the best way for a beginner to learn about Hadoop — unless they’re sitting in his Cloudera University class!)

18 Responses
  • Jun / August 09, 2013 / 3:15 AM

    How about mahout with eclipse in VM… so many errors… especially in MAHOUT_LOCAL model

  • JohnD / August 09, 2013 / 11:19 AM

    Nice demo.

    I used the VirtBox Cloudera VM published here but I had to add eclipse from http://www.eclipse.org/downloads/ since it it appear installed, then I had to cure all the missing library dependencies once the produce was imported – That was a great aid that you showed how to do that.

    I noticed that is no Makefile in your project. How would I run this from the cli … outside of IDE ?

    • Jesse Anderson (@jessetanderson) / August 09, 2013 / 1:18 PM

      The names will vary, but here is how to compile and run from the command line:

      javac -classpath
      hadoop classpath *.java
      java -classpath hadoop classpath ClassNameWithMainMethod

    • Jesse Anderson (@jessetanderson) / August 09, 2013 / 5:41 PM

      All versions of the QuickStart VM have Eclipse installed on them. Right now, the shortcut for Eclipse is only on the Desktop and not under Applications -> Programming.

  • Srinivasan / September 19, 2013 / 7:02 AM

    This post (esp. the video) has been the most helpful in understanding the libraries required for compile and run.

  • Abhay / October 24, 2013 / 2:07 AM

    Thanks for this crisp articel, My question is on mentioned point “Another difference is that all file paths default to local file paths, not HDFS ones.”

    I want to use HDFS files system in CDH4′s eclispe for debugging. can i do it? if yes than how?

  • Jayanthi / November 23, 2013 / 9:14 AM

    I do not see the UnoExample.git in the location provided. Can you provide me an alternate link

  • Chris / December 10, 2013 / 8:18 AM

    I got the following error using git clone git@github.com:eljefe6a/UnoExample.git

    Initialized empty Git repository in /home/training/workspaceTest/UnoExample/.git/
    Permission denied (publickey).
    fatal: The remote end hung up unexpectedly

    What do U suggest

  • Matt Sargent / December 18, 2013 / 6:54 AM

    you state “To answer that question, I have created a screencast showing you…” I can not find any link to that screencast. Can you provide it?

  • Justin Kestelyn (@kestelyn) / December 18, 2013 / 10:01 AM

    The screencast is back!

  • Chris / December 21, 2013 / 3:33 PM

    I keep getting the following error in eclipse when I try to run Card, CardDriver:

    Exception in thread “main” java.lang.NoClassDefFoundError: CardDriver
    Caused by: java.lang.ClassNotFoundException: CardDriver
    at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
    a tjava.lang.ClassLoader.loadClass(ClassLoader.java:247)
    Could not find the main class: CardDriver. Program will exit.

    How can I correct this? Thanks, Chris

    • Jesse Anderson (@jessetanderson) / December 22, 2013 / 10:46 AM

      I’m not sure what’s happening there. I’d try doing a clean build. Are you running this directly from Eclipse or from a JAR?

  • Gaurav / January 03, 2014 / 5:48 AM

    I am trying to run “teragen” program using YARN framework in CDH 4.5 but no luck.
    What i have figured out till now:

    1. By default Demo VM run the MR in MRv1 uses JobTracker and TaskTracker.
    2. You can either run MRv1 or MRv2 but not both.
    3. I disabled MRv1 and configured YARN by increasing the alternatives priority for YARN, deployed the Client Configuration (Using Cloudera Manager).
    4. Submitted the teragen program and can see the YARN framework is in action from Web UI’s.
    5. But it hangs at Map 0% and Reduce %.

    Tried multiple option to no avail.

    Any step by Step guide or doc to execute MR in MRv2 (YARN) mode in CDH4.5?

    Attached doc with screen shots.

    Reference:
    http://www.cloudera.com/content/cloudera-content/cloudera-docs/CM4Ent/latest/Cloudera-Manager-Managing-Clusters/cmmc_adding_YARN_MRv2.html

    Regards
    Gaurav

Leave a comment


− 5 = two