How-to: Create a Simple Hadoop Cluster with VirtualBox
Set up a CDH-based Hadoop cluster in less than an hour using VirtualBox and Cloudera Manager.
Thanks to Christian Javet for his permission to republish his blog post below!
I wanted to get familiar with the big data world, and decided to test Hadoop. Initially, I used Cloudera’s pre-built virtual machine with its full Apache Hadoop suite pre-configured (called Cloudera QuickStart VM), and gave it a try. It was a really interesting and informative experience. The QuickStart VM is fully functional and you can test many Hadoop services, even though it is running as a single-node cluster.
I wondered what it would take to install a small four-node cluster…
I did some research and I found this excellent video on YouTube presenting a step by step explanation on how to setup a cluster with VMware and Cloudera. I adapted this tutorial to use VirtualBox instead, and this article describes the steps used.
High-level diagram of the VirtualBox VM cluster running Hadoop nodes
The overall approach is simple. We create a virtual machine, we configure it with the required parameters and settings to act as a cluster node (specially the network settings). This referenced virtual machine is then cloned as many times as there will be nodes in the Hadoop cluster. Only a limited set of changes are then needed to finalize the node to be operational (only the hostname and IP address need to be defined).
In this article, I created a 4 nodes cluster. The first node, which will run most of the cluster services, requires more memory (8GB) than the other 3 nodes (2GB). Overall we will allocate 14GB of memory, so ensure that the host machine has sufficient memory, otherwise this will impact your experience negatively.
The prerequisites for this tutorial is that you should have the latest VirtualBox installed (you can download it for free); We will be using the CentOS 6.5 Linux distribution (you can download the CentOS x86_64bit DVD iso image).
Base VM Image creation
Create the reference virtual machine, with the following parameters:
- Bridge network
- Enough disk space (more than 40GB)
- 2 GB of RAM
- Setup the DVD to point to the CentOS iso image
when you install CentOS, you can specify the option ‘expert text’, for a faster OS installation with minimum set of packages.
Perform changes in the following files to setup the network configuration that will allow all cluster nodes to interact.
search example.com nameserver 10.0.1.1
NETWORKING=yes HOSTNAME=base.example.com GATEWAY=10.0.1.1
DEVICE=eth0 ONBOOT=yes PROTO=static IPADDR=10.0.1.200 NETMASK=255.255.255.0
Initialize the network by restarting the network services:
$> chkconfig iptables off $> /etc/init.d/network restart
Installation of VM Additions
You should now update all the packages and reboot the virtual machine:
$> yum update $> reboot
In the VirtualBox menu, select Devices, and then Insert Guest…. This insert a DVD with the iso image of the guest additions in the DVD Player of the VM, mount the DVD with the following commands to access this DVD:
$> mkdir /media/VBGuest $> mount -r /dev/cdrom /media/VBGuest
Follow instructions from this web page.
Setup Cluster Hosts
Define all the hosts in the /etc/hosts file in order to simplify the access, in case you do not have a DNS setup where this can be defined. Obviously add more hosts if you want to have more nodes in your cluster.
10.0.1.201 hadoop1.example.com hadoop1 10.0.1.202 hadoop2.example.com hadoop2 10.0.1.203 hadoop3.example.com hadoop3 10.0.1.204 hadoop4.example.com hadoop4
To also simplify the access between hosts, install and setup SSH keys and defined them as already authorized
$> yum -y install perl openssh-clients $> ssh-keygen (type enter, enter, enter) $> cd ~/.ssh $> cp id_rsa.pub authorized_keys
Modify the ssh configuration file. Uncomment the following line and change the value to no; this will prevent the question when connecting with SSH to the host.
Shutdown and Clone
At this stage, shutdown the system with the following command:
$> init 0
We will now create the server nodes that will be members of the cluster.
in VirtualBox, clone the base server, using the ‘Linked Clone’ option and name the nodes hadoop1, hadoop2, hadoop3 and hadoop4.
For the first node (hadoop1), change the memory settings to 8GB of memory. Most of the roles will be installed on this node, and therefore it is important that it have sufficient memory available.
For every node, proceed with the following operations:
Modify the hostname of the server, change the following line in the file:
Where [n] = 1..4 (up to the number of nodes)
Modify the fixed IP address of the server, change the following line in the file:
Where [n] = 1..4 (up to the number of nodes)
Let’s restart the networking services and reboot the server, so that the above changes takes effect:
$> /etc/init.d/network restart $> init 6
at this stage we have four running virtual machines with CentOS correctly configured.
Four Virtual Machines running on VirtualBox, ready to be setup in the Cloudera cluster.
Install Cloudera Manager on hadoop1
Download and run the Cloudera Manager Installer, which will simplify greatly the rest of the installation and setup process.
$> curl -O http://archive.cloudera.com/cm4/installer/latest/cloudera-manager-installer.bin $> chmod +x cloudera-manager-installer.bin $> ./cloudera-manager-installer.bin
To continue the installation, you will have to select the Cloudera free license version. You will then have to define which nodes will be used in the cluster. Just enter all the nodes you have defined in the previous steps(e.g. hadoop1.example.com) separated by a space. Click on the “Search” button. You can then used the root password (or the SSH keys you have generated) to automate the connectivty to the different nodes. Install all packages and services onto the 1st node.
Once this is done, you will select additional service components; just select everything by default. The installation will continue and will complete.
Using the Hadoop Cluster
Now that we have an operational Hadoop cluster, there are two main interfaces that you will use to operate the cluster: Cloudera Manager and Hue.
Cloudera Manager homepage, presenting cluster health dashboards
Similarly to Cloudera Manager, you can access the Hue administration site by accessing: http://hadoop1.example.com:8888, where you will be able to access the different services that you have installed on the cluster.
Hue interface, and here more specifically, an Impala saved queries window.
I have been able to create a small Hadoop cluster in probably less than a hour, largely thanks to the Cloudera Manager Installer, which simplifies the installation to the simplest of operation. It is now possible to execute and use the various examples installed on the cluster, as well as understand the interactions between the nodes. Comments and remarks are welcome!