How-to: Create a Simple Hadoop Cluster with VirtualBox

Categories: CDH Guest Hadoop QuickStart VM

(Editor’s note [Aug. 2, 2016]: A multi-cluster option for Docker-based deployment is now available for CDH 5.8 and later.)

Thanks to Christian Javet for his permission to republish his blog post below!

I wanted to get familiar with the big data world, and decided to test Hadoop. Initially, I used Cloudera’s pre-built virtual machine with its full Apache Hadoop suite pre-configured (called Cloudera QuickStart VM), and gave it a try. It was a really interesting and informative experience. The QuickStart VM is fully functional and you can test many Hadoop services, even though it is running as a single-node cluster.

I wondered what it would take to install a small four-node cluster…

I did some research and I found this excellent video on YouTube presenting a step by step explanation on how to setup a cluster with VMware and Cloudera. I adapted this tutorial to use VirtualBox instead, and this article describes the steps used.


High-level diagram of the VirtualBox VM cluster running Hadoop nodes

The overall approach is simple. We create a virtual machine, we configure it with the required parameters and settings to act as a cluster node (specially the network settings). This referenced virtual machine is then cloned as many times as there will be nodes in the Hadoop cluster. Only a limited set of changes are then needed to finalize the node to be operational (only the hostname and IP address need to be defined).

In this article, I created a 4 nodes cluster. The first node, which will run most of the cluster services, requires more memory (8GB) than the other 3 nodes (2GB). Overall we will allocate 14GB of memory, so ensure that the host machine has sufficient memory, otherwise this will impact your experience negatively.


The prerequisites for this tutorial is that you should have the latest VirtualBox installed (you can download it for free); We will be using the CentOS 6.5 Linux distribution (you can download the CentOS x86_64bit DVD iso image).

Base VM Image creation

VM creation

Create the reference virtual machine, with the following parameters:

  • Bridge network
  • Enough disk space (more than 40GB)
  • 2 GB of RAM
  • Setup the DVD to point to the CentOS iso image

when you install CentOS, you can specify the option ‘expert text’, for a faster OS installation with minimum set of packages.

Network Configuration

Perform changes in the following files to setup the network configuration that will allow all cluster nodes to interact.






Initialize the network by restarting the network services:

Installation of VM Additions

You should now update all the packages and reboot the virtual machine:

In the VirtualBox menu, select Devices, and then Insert Guest…. This insert a DVD with the iso image of the guest additions in the DVD Player of the VM, mount the DVD with the following commands to access this DVD:

Follow instructions from this web page.

Setup Cluster Hosts

Define all the hosts in the /etc/hosts file in order to simplify the access, in case you do not have a DNS setup where this can be defined. Obviously add more hosts if you want to have more nodes in your cluster.


Setup SSH

To also simplify the access between hosts, install and setup SSH keys and defined them as already authorized

Modify the ssh configuration file. Uncomment the following line and change the value to no; this will prevent the question when connecting with SSH to the host.


Shutdown and Clone

At this stage, shutdown the system with the following command:

We will now create the server nodes that will be members of the cluster.

in VirtualBox, clone the base server, using the ‘Linked Clone’ option and name the nodes hadoop1, hadoop2, hadoop3 and hadoop4.

For the first node (hadoop1), change the memory settings to 8GB of memory. Most of the roles will be installed on this node, and therefore it is important that it have sufficient memory available.

Clones Customization

For every node, proceed with the following operations:

Modify the hostname of the server, change the following line in the file:


Where [n] = 1..4 (up to the number of nodes)

Modify the fixed IP address of the server, change the following line in the file:


Where [n] = 1..4 (up to the number of nodes)

Let’s restart the networking services and reboot the server, so that the above changes takes effect:

at this stage we have four running virtual machines with CentOS correctly configured.

Four Virtual Machines running on VirtualBox, ready to be setup in the Cloudera cluster.

Install Cloudera Manager on hadoop1

Download and run the Cloudera Manager Installer, which will simplify greatly the rest of the installation and setup process.

Use a web browser and connect to (or if you have not added the hostnames into a DNS or hosts file).

To continue the installation, you will have to select the Cloudera free license version. You will then have to define which nodes will be used in the cluster. Just enter all the nodes you have defined in the previous steps(e.g. separated by a space. Click on the “Search” button. You can then used the root password (or the SSH keys you have generated) to automate the connectivty to the different nodes. Install all packages and services onto the 1st node.

Once this is done, you will select additional service components; just select everything by default. The installation will continue and will complete.

Using the Hadoop Cluster

Now that we have an operational Hadoop cluster, there are two main interfaces that you will use to operate the cluster: Cloudera Manager and Hue.

Cloudera Manager

Use a web browser and connect to (or if you have not added the hostnames into a DNS or hosts file).

Cloudera Manager homepage, presenting cluster health dashboards


Similarly to Cloudera Manager, you can access the Hue administration site by accessing:, where you will be able to access the different services that you have installed on the cluster.

Hue interface, and here more specifically, an Impala saved queries window.


I have been able to create a small Hadoop cluster in probably less than a hour, largely thanks to the Cloudera Manager Installer, which simplifies the installation to the simplest of operation. It is now possible to execute and use the various examples installed on the cluster, as well as understand the interactions between the nodes. Comments and remarks are welcome!


26 responses on “How-to: Create a Simple Hadoop Cluster with VirtualBox

  1. Pavel

    Hi, Cristian!

    I am trying to repeat your steps but I am failing at the begining.
    Please tell a bit more about host gateway configuration.

    Your host gateway is configured to be
    How to get my host gateway IP adress?
    If this host gateway is my physical computer, how I can configure it to use it as gateway in virtual cluster?
    If this host gateway is special virtual machine – can you give me the link, how I must configure it?

  2. Simon

    Pavel – please read up about setting up bridged mode on VirtualBox. The range of IP addresses you will be using should be in the range of actual platform you are running VirtualBox on: and you should be specifying the same gateway to your router that goes to the internet to get all those packages.

    Great tutorial. A very minor point would be that you need to pull off the ssh private key onto the main host so that the web browser can pick it up.

  3. Chris

    Thank you for an excellent tutorial. I was able to complete this and get the Cloud Manager and Hue running.

    I have a question. I followed and completed this when I was at home, on my home network. I shut all of the VM’s off and went to bed. When I got to work, I fired them all up again and found that the CloudManager and Hue urls are not working. Why might be this be? Does is have something to do with the host machine being on a different network?

    Thank you for any insight!

  4. Anthony Bisong

    Chris the reason the CloudManager and Hue urls are no more working is because when you move from one network to another network the ip’s changed. You should go back to the /etc/hosts and update it with your new network ip’s

    Anthony Bisong

  5. John

    Quick question. I am able to complete all the steps up to where I begin installing; in particular, it freezes after i install Zookeeper and start on HBase. I’m guessing I allocated the memory incorrectly? Im using a 2 Ghz Core i7/8GB MacBook Pro (v 10.9.3). I Any ideas?

  6. StephenC

    Hey Christian,

    Followed all the steps but when I get to this section:
    Use a web browser and connect to (or if you have not added the hostnames into a DNS or hosts file).

    Nothing happens for me, I get this error on firefox:

    Firefox can’t establish a connection to the server at

    Any idea why? Should this be done on the base node or Hadoop1 ?

  7. Chris

    Thank you, I was able to get this up and running.

    I am just starting to learn about these technologies and have recently been running Mappers and Reducers written in Java on Hadoop 0202 in stand-alone mode on my Mac.

    I would like to graduate to this environment, but I’m not sure what to do… at all. Where do I go to learn how to use this awesome set up I now have : )

    With the stand alone, I worked on the command line to compile my java and then run hadoop with an input file and a jar.

    Thanks for any pointers, tips, resources!

  8. LeeDog

    Thanks, Christian / Cloudera. Great post.

    Just a couple points:

    Perl must be installed before the guest additions, or it fails. You have to dig into the mentioned log to determine this.

    VirtualBox 4.3 (on Mac OS X with 8GB) wouldn’t let me adjust the first node VM up to 8GB after constructing it initially from 2GB. Will construct base node with 8GB and scale the others back to 2GB.


  9. Andrew Zhang

    I was failed to install at the end with error: “can not receive heartbeat from the agent”. I checked the cludera doc and it requires this command work: “host -v -t A hostname“:

    However, my images on the VMWare shows this:

    (root@cchadoop1)\>host -v -t A hostname
    Trying “”
    Trying “cchadoop1”
    Host cchadoop1 not found: 3(NXDOMAIN)
    Received 102 bytes from in 14 ms

    Did you actually check if this works?

    host -v -t A hostname

  10. Cheriat

    Thank you so much for the post. It’s very interesting.
    This is my first test for multiple nodes. I have problems when installing perl and openssh-clients. What worries me is the message that appears when I run the command yum-y install perl. I got the folowing error message “Could not retrives mirrorlist http://mirrorlist.centos.or … 14 : PyCURL ERROR 6 – “Couldn’t resolve host ‘'”. I checked all the files and they are all set as you indicated. Did I miss something? Thank you for your help.

  11. Francisco

    I followed all steps and I was able to install cdh5. Now, I’m trying to run a job from Pentaho using map reduce, but I’m having troubles with the jobtracker port. I have tried to configured, adding some sentences to file mapped-site.xml:


    Could you help me please.

  12. Francisco

    Hi Cheriat.
    Your problem might be the IP address you are using for every node. For example, what I did was:
    – my IP address is (the IP of the host)
    – then, my node 1 IP is and so on with the number of nodes that you want.
    – node 2
    – node 3
    – etc.

    After restart, you can check if your config is OK, by doing ping: ping

    Hope this help you.

  13. prassan

    Hi sir,

    I am trying setup the multi node cluster using the steps you have provided here ,but i have confused on this step,
    mkdir /media/VBGuest
    mount -r /dev/cdrom /media/VBGuest
    while mount the cd to this ‘/media/VBGuest’ location its showing ”You must specify the filesystem type ‘error message.can you please help me on this,

    one more doubt is the what is the link -‘FOLLOW THE BELOW PAGE ‘ after that ?why i should i do that ?

    Thank you,

  14. lobna

    Thank you very much for this tutorial. I’m a begginer with hadoop i did follow all the steps but have some problems after installing cloudera and adding the hosts it says that cloudera manager agent must have version 4.8.5, i did install the version 4.8.5 of cdh but have the same message. Can u please help me

  15. Stephane

    Hello ,
    Thank you for the nice tutorial.
    I followed the steps but when installing cloudera manager across machines, SSH are said no to be running while they are actually running on hosts.
    Do you have any clue?
    Thanks in advance.

  16. Alin

    This tutorial is great with some exceptions:
    – virtualbox for windows suport only 32 bit OS and for this purpose the final isntallation of cloudera is not possible, as it work only on 64 bit hosts, only mac version has this feature, so now i am stack at last point of cloudera manager installation, any tips or tricks for this situation?

  17. Shivam Gaur


    I am stuck at the step:

    Initialize the network by restarting the network services:

    When I type:
    chkconfig iptables off

    I get the following error message:
    error reading information on service iptables: No such file or directory

    How to I fix this?

    Im using CentOS version 7 64 bit

    1. Ahshan MD

      Shivam, i had the same issue and each time i restart the network service it would reset the nameserver IP address set in the /etc/resolv.conf file , in-order to fix this
      update the resolv.conf file with the appropriate Gateway Ip address and then follow below steps by stopping the networkManager service with the below command and save the settings
      **do not use NetworkManager
      1. chkconfig NetworkManager off
      2. service NetworkManager stop
      3. chkconfig network on
      4. service network restart
      5 .system-config-network
      let me know, for any further issues

  18. Swapnil Sharma

    Hi! I want to create a cluster for carrying out simulations on COMSOL or Fluent. Can the nodes have different operating systems? I want to use 4gb RAM, Ubuntu 15.10 as main node and 8gb RAM, Virtual Machine’s Ubuntu on windows 10 or windows 8.1 in other nodes. Is it a good idea?
    PS : I am not a computer expert and this is my first time with clusters.

  19. Robert Sachar

    I has similar issues as the other users
    Simply download the latest cloud era manager instead of the link specified in the blog and the issues go away
    Others steps in the blog are ok

  20. Yvon Cadieux


    realy good tutorial, I have one question for the network:

    Si, I setup the card as Virtutual but for the network config what I should do If I’m running under a ROUTER with

    Should I specify this adress as my gateway ?? or I can give any adress same as :


    Because I have enough cpu and memory to create the cluster, my issue is more about the network !!

    I try few times and node1 can’t ping node2 etc etc …

    Merci, Thanks