How-to: Use Vagrant to Set Up a Virtual Hadoop Cluster (For CDH 4)

This guest post comes to us from David Greco, CTO of Eligotech. For a how-to on this subject for CDH 5, see this post.

Vagrant is a very nice tool for programmatically managing many virtual machines (VMs) on a single physical machine. It natively supports VirtualBox and also provides plugins for VMware Fusion and Amazon EC2, supporting the management of VMs in those environments as well.

Vagrant provides a very easy-to-use, Ruby-based internal DSL that allows the user to define one or more virtual machines together with their configuration parameters. Furthermore, it offers different mechanisms for automatic provisioning: You can use Puppet, Chef, or shell scripts for automating software installation and configuration on the machines defined in the Vagrant configuration file.

So, using Vagrant, it’s possible to define complex virtual infrastructures based on multiple VMs running on your system. Pretty cool, no?

A typical use case for Vagrant is to build working/development environments in a simple and consistent way. At my company, Eligotech, we are developing a product aimed to simplify the usage of Apache Hadoop, and CDH, Cloudera’s open source distribution, is our reference Hadoop distribution. We often need to set up a Hadoop environment on our machine for testing purposes, and we found Vagrant to be a very handy tool for that purpose.

I put together an example of a Vagrant configuration file that you can test for yourself. You’ll need to download and install Vagrant (instructions) and VirtualBox. Once everything has been installed, just copy-and-paste the text below to a file named Vagrantfile and put it in a directory named, for example, VagrantHadoop. This configuration file assumes you have at least 32GB of memory on your box; if that’s not the case, you can edit the file to suit your environment (to run fewer slaves, for example, by commenting out some of the slave configurations).

 

This file defines six machines to be assigned the following CDH 4 roles:

  • vm-cluster-node1: This is the master; besides running the CM master, it should run the namenode, secondary namenode, and jobtracker.
  • vm-cluster-node2: This is a slave, it should run a datanode and a tasktracker.
  • vm-cluster-node3: This is a slave, it should run a datanode and a tasktracker.
  • vm-cluster-node4: This is a slave, it should run a datanode and a tasktracker.
  • vm-cluster-node5: This is a slave, it should run a datanode and a tasktracker.
  • vm-cluster-client: This machine plays the role of gateway for the cluster.

Click here to learn the meaning of the different items in the configuration file. In particular, you can see that depending on the particular provider, either VirtualBox or VMware Fusion, the memory size is changed in a different way. Observe how simple it is to switch between providers for customizing environment-specific things!

This Vagrant file does another very important thing: It installs Cloudera Manager automatically on the master node, vm-cluster-node1.

To create the virtual cluster, open a shell and just go to the directory holding the Vagrant file, i.e. VagrantHadoop. Under that directory, run:

 

After a while, depending  on how fast your machine is, Vagrant will return control — meaning that all the VMs are up and running.

At this point you are ready to configure your cluster through CM’s web UI via http://vm-cluster-node1:7180.

Have fun!

9 Responses
  • bryantrobbins / April 28, 2013 / 11:11 AM

    This is fantastic. Thank you very much for the Vagrantfile config – I’m sure I’m not the only one out there that had been trying to hack this out myself for a while.

    One thing to note is that the box the configuration depends on (assuming its Vagrant’s official “precise64″) needed to be installed manually with this command before I could vagrant up successfully:
    vagrant box add precise64 http://files.vagrantup.com/precise64.box

    Thanks for this!

  • santhikumar / May 08, 2013 / 8:48 AM

    Hi

    I could not get the http://vm-cluster-node1:7180 working .

    What’s the default username / password for these Ubuntu instances to login to each box ?

    Also the VM’s in Virtualbox are greyed out, but status shows as running.

    I’m using Windows 8 OS

    VirtualMachine version : 4.2.12
    Vagrant_1.2.2

    Any sort of help much appreciated.

    Thanks
    Santhi

  • santhikumar / May 10, 2013 / 2:51 AM

    Got the cluster setup working with Vagrant.

    Ubuntu VM login credentials are : vagrant/vagrant

    CM’s web UI login : admin/admin.

    Thanks a lot for the Vagrantfile.

  • david v / June 03, 2013 / 10:28 AM

    I can’t get the link to work on mine :( either..

    http://vm-cluster-node1:7180

    seems like the cluster is running though.

  • Tiny Tim / November 01, 2013 / 12:22 PM

    Could you put the config file up somewhere?

    I think I have some copy/pasta inspired errors:

    There is a syntax error in the following Vagrantfile. The syntax error
    message is reproduced below for convenience:

    VagrantHadoop/Vagrantfile:99: syntax error, unexpected ‘:’, expecting kEND
    master.vm.network :private_network, ip: “10.211.55.100″
    ^
    VagrantHadoop/Vagrantfile:113: syntax error, unexpected ‘:’, expecting kEND
    slave1.vm.network :private_network, ip: “10.211.55.101″

  • Tiny Tim / November 04, 2013 / 12:20 PM

    Nevermind… the version of vagrant from the repo was wicked old!

  • amit jain / November 22, 2013 / 2:28 PM

    Hi,

    Can someone do a quick video on how to build a hadoop cluster using multiple Cloudera Quick Start VMs ? so that we can really learn hadoop well in a distributed manner

    thanks
    -amit

  • cevaris / November 27, 2013 / 11:00 PM

    If you do not setup the hosts file locally, you have to ping Cloudera via with the physical address of master node.

    http://10.211.55.100:7180

  • Mike Luo / December 27, 2013 / 1:38 PM

    Thanks for the great Vagrant script.

    FWIW, I have VirtualBox 4.3.6 and have to turn Intel VT on in my BIOS to have VB works.

Leave a comment


seven − 7 =