Apache Hadoop HA Configuration

Disclaimer: Cloudera no longer approves of the recommendations in this post. Please see this documentation for configuration recommendations.

One of the things we get a lot of questions about is how to make Hadoop highly available. There is still a lot of work to be done on this front, but we wanted to take a moment and share the best practices from one of our customers. Check out what Paul George has to say about how they keep thier NameNode up at ContextWeb. – Christophe

Here at ContextWeb, our Apache Hadoop infrastructure has become a critical part of our day-to-day business operations. As such, it was important for us to find a way to resolve the single-point-of-failure issue that surrounds the master node processes, namely the NameNode and JobTracker. While it was easy for us to follow the best practice of offloading the secondary NameNode data to an NFS mount to protect metadata, ensuring that the processes were constantly available for job execution and data retrieval were of greater importance. We’ve leveraged some existing, well tested components that are available and commonly used in Linux systems today. Our solution primarily makes use of DRBD from LINBIT and Heartbeat from the Linux-HA project. The natural combination of these two projects provides us with a reliable and highly available solution, which addresses limitations that currently exist.

While one could conceivably expand the use of these two projects to much deeper levels of protection, the goal of this post is to provide a basic working configuration as a starting point for further experimentation and tuning. There may be variations with regards to what works with your distribution or which requirements your organization has for SLAs and HA standards. These instructions are most relevant to CentOS 5.3 combined with Cloudera’s Distribution for Hadoop, since that’s what we run in our production environment.

Hadoop Environment

Each of our master nodes has the following hardware specifications:

Dell PowerEdge 1950, 2x Quad Core Intel Xeon E5405 CPUs @ 2.0GHz, 16GB RAM, 2x 300GB 15k RPM SAS disks, RAID 1. 100GB of the available drive space is reserved for the DRBD volume, which will contain Hadoop’s data.

We use RAID and redundant hardware capabilities of the servers wherever possible to provide additional security for the master node processes. Each master node is connected to a different switch on our network using multiple network connections. The following diagram represents our production cluster configuration:

namenode_ha2

HA Setup and Configuration

Our planned hosts:

Node Hostname IP Address
1 master1.domain.com 192.168.4.116
2 master2.domain.com 192.168.4.117
Virtual hadoop.domain.com 192.168.4.115

Properties that will be defined as part of our hadoop-site.xml:

Property Value Notes
dfs.data.dir /hadoop/hdfs/data On DRBD replicated volume
dfs.name.dir /hadoop/hdfs/namenode On DRBD replicated volume
fs.checkpoint.dir /mnt/hadoop/secondarynamenode On NFS
fs.default.name hdfs://hadoop.contextweb.prod:8020 Shared virtual name
mapred.job.tracker hadoop.contextweb.prod:8021 Shared virtual name

With this environment in mind, configuring the HA setup comes down to six parts:

  1. Install Sun JDK 6
  2. Configure networking
  3. Install DRBD and Heartbeat packages
  4. Configure DRBD
  5. Install Hadoop RPMs from Cloudera
  6. Configure Heartbeat

** Except where noted, all procedures should be followed on both nodes. **

1. Install Sun JDK

The only version of Java JDK that should be used with Hadoop is Sun’s own. This has been well documented. Grab a copy of the latest JDK from http://java.sun.com/javase/downloads/index.jsp and install it on both of the nodes. At the time of this writing, the latest version is jdk-6u14-linux-amd64.

We download the rpm.bin file and complete the installation:

[root@master1 ~]# chmod +x jdk-6u14-linux-x64-rpm.bin
[root@master1 ~]# ./jdk-6u14-linux-x64-rpm.bin

2. Configure Networking

Each of our servers has two embedded gigabit ethernet ports, and we choose to bond them for HA and bandwidth purposes. We use LACP/802.3ad, which also requires changes to the switch configuration to support this mode. If you don’t have LACP enabled switches, or cannot modify their configurations, there are other bonding options available through the driver.

You can read more about Linux network bonding from /usr/share/doc/kernel-doc-2.6.18/Documentation/networking/bonding.txt on your system (requires installation of the package kernel-doc).

The following is an example from our systems.

Edit the file /etc/modprobe.conf:

#/etc/modprobe.conf
alias eth0 bnx2
alias eth1 bnx2
alias bond0 bonding
options bond0 mode=4 miimon=100 lacp_rate=1

Edit the file /etc/sysconfig/network-scripts/ifcfg-eth0:

#/etc/sysconfig/network-scripts/ifcfg-eth0
DEVICE=eth0
MASTER=bond0
SLAVE=yes
BOOTPROTO=static
DHCPCLASS=
ONBOOT=yes

Edit the file /etc/sysconfig/network-scripts/ifcfg-eth1:

#/etc/sysconfig/network-scripts/ifcfg-eth1
DEVICE=eth1
MASTER=bond0
SLAVE=yes
BOOTPROTO=static
DHCPCLASS=
ONBOOT=yes

Edit the file /etc/sysconfig/network-scripts/ifcfg-bond0:

#/etc/sysconfig/network-scripts/ifcfg-bond0
DEVICE=bond0
BOOTPROTO=static
IPADDR=192.168.4.116
NETMASK=255.255.255.0
GATEWAY=192.168.4.1
BONDING_MODULE_OPTS=”mode=4 miimon=100 lacp_rate=1″

Finally, reboot the system or restart networking:

[root@master1 ~]# service network restart

We also mount the NFS export at this time. It will be used as the base directory for the secondary namenode:

[root@master1 ~]# echo “filer:/hadoop /mnt/hadoop nfs rsize=65536,wsize=65536,intr,soft,bg 0 0″ >> /etc/fstab
[root@master1 ~]# mkdir /mnt/hadoop
[root@master1 ~]# mount /mnt/hadoop

3. Install DRBD and Heartbeat Packages

DRBD (including its kernel module) and Heartbeat are part of the “extras” repository:

[root@master1 ~]# yum -y install drbd82 kmod-drbd82 heartbeat
[root@master1 ~]# chkconfig –add heartbeat

Tip: Sometimes the installation of the Heartbeat package fails on the first try. Just try again; it may work for you the second time.

4. Configure DRBD

Important Note: Before continuing with the DRBD configuration, I highly recommend reading through the documentation and reviewing examples to get a clear understanding of the architecture and intended goals: http://www.drbd.org/docs/about/.

In our kickstart configuration, we reserved 100GB of space in the RAID set; this will be used for our Hadoop data.

The following /etc/drbd.conf file is created on both nodes:


#
# drbd.conf example
#

global { usage-count no; }

resource r0 {
protocol C;
syncer { rate 100M; }
startup { wfc-timeout 0; degr-wfc-timeout 120; }


on master1.domain.com {
device /dev/drbd0;
disk /dev/sda4;
address 192.168.4.116:7788;
meta-disk internal;
}


on master2.domain.com {
device /dev/drbd0;
disk /dev/sda4;
address 192.168.4.117:7788;
meta-disk internal;
}
}

The following tasks are performed on BOTH nodes to set up the DRBD resource and start the DRBD service. (You should be prepared to perform these tasks at the same time, so that the peers locate each other when you start the DRBD service.)

[root@master1 ~]# echo “/dev/drbd0 /hadoop     ext3     defaults,noauto     0 0″ >> /etc/fstab
[root@master1 ~]# mkdir /hadoop
[root@master1 ~]# yes | drbdadm create-md r0
[root@master1 ~]# service drbd start

Tip: If the destination partition or disk that you are using for your DRBD volume previously had a file system created on it, you may receive a warning like the following one:

Command ‘drbdmeta /dev/drbd0 v08 /dev/sda4 internal create-md’ terminated with exit code 40
drbdadm aborting

If this is the case, try zeroing the destination partition disk with dd:

[root@master1 ~]# dd if=/dev/zero of=/dev/sda4

After the process is started, the following two commands should be run on the PRIMARY server ONLY:

[root@master1 ~]# drbdadm — –overwrite-data-of-peer primary r0
[root@master1 ~]# mkfs.ext3 /dev/drbd0
[root@master1 ~]# mount /hadoop

It will take some time to synchronize based on the size of the disk and the “syncer” rate defined in your drbd.config. You can check the status of this process from /proc/drbd:

[root@master1 ~]# cat /proc/drbd
version: 8.2.6 (api:88/proto:86-88)
GIT-hash: 3e69822d3bb4920a8c1bfdf7d647169eba7d2eb4 build by buildsvn@c5-x8664-build, 2008-10-03 11:30:17
0: cs:SyncSource st:Primary/Secondary ds:UpToDate/Inconsistent C r—
ns:18440304 nr:0 dw:27072452 dr:18511901 al:11746 bm:12767 lo:14 pe:12 ua:246 ap:1 oos:84438604
[==>.................] sync’ed: 18.0% (82459/100465)M
finish: 0:14:31 speed: 96,904 (77,472) K/sec


[root@master1 ~]# cat /proc/drbd
version: 8.2.6 (api:88/proto:86-88)
GIT-hash: 3e69822d3bb4920a8c1bfdf7d647169eba7d2eb4 build by buildsvn@c5-x8664-build, 2008-10-03 11:30:17
0: cs:Connected st:Primary/Secondary ds:UpToDate/UpToDate C r—
ns:196492540 nr:0 dw:195994668 dr:3687117 al:1355324 bm:68 lo:0 pe:0 ua:0 ap:0 oos:0

5. Install Hadoop RPMs from Cloudera

We used the web-based configurator provided by Cloudera (https://my.cloudera.com/), which builds an RPM containing repos for your custom configuration and the rest of their distribution. The resulting RPM is then installed on BOTH master nodes.

Important Note: When we defined the hostname in the web-based configurator, we gave the name of the shared virtual hostname that is bound to the VIP as described earlier.

[root@master1 ~]# rpm -ivh cloudera-repo-0.1.0-1.noarch.rpm

Since we’re only installing the master nodes, we skip the datanode and tasktracker RPMs on these hosts:

[root@master1 ~]# yum -y install hadoop hadoop-conf-pseudo hadoop-jobtracker \
> hadoop-libhdfs.x86_64 hadoop-namenode hadoop-native.x86_64 hadoop-secondarynamenode

After this, we install our custom-generated Hadoop config from an RPM named hadoop-conf- plus the cluster name and hostname that we defined in the config generator:

[root@master1 ~]# yum -y install hadoop-conf-prod-hadoop.domain.com
[root@master1 ~]# alternatives –display hadoop
hadoop – status is auto.
link currently points to /etc/hadoop/conf.prod-hadoop.domain.com
/etc/hadoop/conf.empty – priority 10
/etc/hadoop/conf.pseudo – priority 30
/etc/hadoop/conf.prod-hadoop.domain.com – priority 60
Current `best’ version is /etc/hadoop/conf.prod-hadoop.domain.com.

We also disable automatic startup of the Hadoop processes on the node since they will be managed by the Heartbeat package:

[root@master1 ~]# chkconfig hadoop-namenode off
[root@master1 ~]# chkconfig hadoop-secondarynamenode off
[root@master1 ~]# chkconfig hadoop-jobtracker off

On the primary server only, we must start the namenode for the first time and format HDFS:

[root@master1 ~]# chown -R hadoop:hadoop /hadoop
[root@master1 ~]# /sbin/runuser -s /bin/bash – hadoop -c ‘hadoop namenode -format’

Finally, dismount the DRBD volume mounted at /hadoop since it will be brought online through Heartbeat:

[root@master1 ~]# umount /hadoop

6. Configure Heartbeat

There are many options available for the Heartbeat configuration. Here, we attempt to show only the basics. While this example should work in most cases, you may wish to extend the configurations to take advantage of other features that the package provides.

There are three key files that we edit to configure the Heartbeat package:

  • /etc/ha.d/ha.cf
  • /etc/ha.d/haresources
  • /etc/ha.d/authkeys

The first, ha.cf, defines the general settings of the cluster. Our example:

## start of ha.cf
debugfile /var/log/ha.debug
logfile /var/log/ha.log
logfacility local0
keepalive 1
initdead 60
deadtime 5
mcast eth0 239.0.0.22 694 1 0
node master1.domain.com master2.domain.com
auto_failback off
## end of ha.cf

Additional parameters and their meanings can be found at http://linux-ha.org/ha.cf/.

The second file we’ll edit, haresources, defines all cluster resources that will fail over from one node to the next. The resources include the shared IP address of the cluster, the DRBD resource “r0″ (from /etc/drbd.conf), the file system mount, and the three Hadoop master node initiation scripts that are invoked with the “start” parameter upon failover.

This file must be the same on both nodes in the cluster. Note that the leading host name defines the preferred node. The IPaddr stated in this file will be the virtual IP that we chose in our planning phase. This IP will be brought up on the active node on the aliased interface bond0:0:

## start of haresources
master1.domain.com IPaddr::192.168.4.115
master1.domain.com drbddisk::r0
master1.domain.com Filesystem::/dev/drbd0::/hadoop::ext3::defaults
master1.domain.com hadoop-namenode
master1.domain.com hadoop-secondarynamenode
master1.domain.com hadoop-jobtracker
## end of haresources

The last file file to be edited, authkeys, should also be the same on both servers. Our example:

## start of authkeys
auth 1
1 sha1 hadoop-master
## end of authkeys

Change permissions on the authkeys file to keep the shared secret protected:

[root@master1 ~]# chmod 0600 /etc/ha.d/authkeys

Enable Heartbeat service to start on boot, and start the service on each node:

[root@master1 ~]# chkconfig –add heartbeat
[root@master1 ~]# chkconfig heartbeat on
[root@master1 ~]# service heartbeat start

Setup Complete

At this point, your master node processes should be brought up automatically by Heartbeat. If there are any problems, you should start resolving them by checking /var/log/ha.log for hints on what went wrong. If everything worked, feel free to test your failover by rebooting the master, pulling its power cord, or using whatever your favorite method of simulating a system failure may be. Based on the configuration values from this example, we find that the system takes about 10 seconds to bring everything back up online.

17 Responses
  • Mark Kerzner / July 22, 2009 / 4:42 PM

    So cool! Thank you very much for sharing, will certainly use it when the time comes.

  • Dan Kressin / September 24, 2009 / 2:42 PM

    Question: In step 4 you write “After the process is started, the following two commands should be run on the PRIMARY server ONLY:

    [root@master1 ~]# drbdadm — –overwrite-data-of-peer primary r0
    [root@master1 ~]# mkfs.ext3 /dev/drbd0
    [root@master1 ~]# mount /hadoop

    You say the next “two” commands and then list three together. Is “mount /hadoop” supposed to be run on BOTH servers or only the PRIMARY? I tried it on both, and the secondary gives the following output:
    secondary:~ # mount /hadoop
    mount: block device /dev/drbd0 is write-protected, mounting read-only
    mount: Wrong medium type
    secondary:~ #

    The result is /hadoop is _not_ mounted on my secondary. Is this expected?

    Thanks!
    -Dan

  • Paul George / September 29, 2009 / 7:13 PM

    @Dan Kressin

    Sorry for the confusion, all THREE commands should be run on the PRIMARY only. The error you received is correct since the drbd layer will prevent the replicated volume from being mounted on more than one system at a time. You should be able to do ‘cat /proc/drbd’ on both servers at that point and see the Primary/Secondary status after the “st:” field.

  • nishat akhtar / October 02, 2009 / 10:22 PM

    sir,we are trying to achieve clustering used in our project..
    Hope the above documentations help us out…

  • Jeff / March 30, 2010 / 10:46 AM

    The one thing missing from this post is the actual hadoop config info for the namenode and backup namenod. The post simply says you used the cloudera configuration tool to generate a custom config….As we all know this tool has been unavailable for quite some time. So a little more insight as to the actual hdfs/mapred configs would be like the icing on the cake that this post already is.

  • Sean / May 18, 2010 / 7:14 AM

    Thx for posting! Very helpful. I was able to get most everything working. Just a somewhat minor issue when failing over: DRBD can’t release its resource resulting in an what Heartbeat calls an ungraceful shutdown forcing a reboot. Master2 comes up fine though.

  • Carlos / June 09, 2010 / 7:09 AM

    I am confuse about the picture on this post.
    We are trying to construct out namenode with HA ,
    so we wanna using DRBD + HeartBeat to provide namenode a failover system.
    But in the picture , it looks like the backup namemode have it’s own datanode cluster , why ?
    Do we need two cluster for the HA configuration ??

  • Charlie / August 20, 2010 / 6:29 AM

    Firstly thanks for the post, its very helpful but I have a couple of questions:

    1. You say in the few configuration options that you have specified that dfs.data.dir should be set to a directory on the drbd partition, but from what I understand about this option all it does is tells the data nodes where to store its blocks, is this right? If so, the masters/namenodes do not worry about this option and it should therefore not be set to a directory on the drbd partition.

    2. I have set this up on 4 machines for testing, 2 namenodes with drbd/heartbeat and 2 nodes, of which 1 of them is also acting as the secondary namenode. All is working fine apart from the very big problem of when the namenode is switched over from 1 master to the other, the data stored in hdfs is no longer there, I therefore seem to have 2 different hdfs’s, I can create directories in both and they only appear while that particular master is running. Any’s idea’s why this would happen?

  • Charlie / August 20, 2010 / 7:46 AM

    Scrap question number 2, it was a split-brain with drbd which stopped the sync’ing, :-/

  • PATRICK / August 21, 2010 / 10:11 AM

    Charlie, et.al.,

    Charlie – you are right, that appears to be an error in the blog. You do not want to set dfs.data.dir to a drbd partition. I’m also glad that you resolved the split-brain issue.

    That said, we want to clarify our position that we do not officially support nor fully endorse this method of doing NameNode HA. DRBD is hard to get right, and you can easily corrupt your fsimage file with a subtle mistake. Furthermore, the backup NameNode will start up and enter safe-mode (read only) until it has gone through and processed the edit log to reconstitute the up-to-date fsimage in-memory. This process can take several minutes, sometimes over an hour if you let the edit log grow unchecked. You can alleviate this by having the SecondaryNameNode checkpoint more often, and backing up the checkpoints so you have a fairly recent image as a fallback in case of catastrophic failure.

    The community is hard at work at building high-availability natively into HDFS. Until then, our recommendation is to design your application around HDFS’s availability semantics. This might include using Flume for reliable delivery, or having mirrored HDFS systems.

  • nishat akhtar / September 05, 2010 / 12:13 PM

    I am trying to set up hive on my hadoop. I am also able to start hive after mentioning the hadoop path but i am unable to create any tables as every time i am getting the metadata error. Can you please help me out to find the solution.

  • masterpapers / June 29, 2011 / 10:09 AM

    Great job, you’ve helped me so much and I am so glad I have chosen your service.

Leave a comment


× two = 2