How-to: Deploy Apache Hadoop Clusters Like a Boss

Categories: Hadoop Hardware How-to

Learn how to set up a Hadoop cluster in a way that maximizes successful production-ization of Hadoop and minimizes ongoing, long-term adjustments.

Previously, we published some recommendations on selecting new hardware for Apache Hadoop deployments. That post covered some important ideas regarding cluster planning and deployment such as workload profiling and general recommendations for CPU, disk, and memory allocations. In this post, we’ll provide some best practices and guidelines for the next part of the implementation process: configuring the machines once they arrive. Between the two posts, you’ll have a great head start toward production-izing Hadoop.

Specifically, we’ll cover some important decisions you must make to ensure network, disks, and hosts are configured correctly. We’ll also explain how disks and services should be laid out to be utilized efficiently and minimize problems as your data sets scale.

Networking: May All Your SYNs Be Forgiven

Hostname Resolution, DNS and FQDNs

A Hadoop Java process such as the DataNode gets the hostname of the host on which it is running and then does a lookup to determine the IP address. It then uses this IP to determine the canonical name as stored in DNS or /etc/hosts. Each host must be able to perform a forward lookup on its own hostname and a reverse lookup using its own IP address. Furthermore, all hosts in the cluster need to resolve other hosts. You can verify that forward and reverse lookups are configured correctly using the Linux host command.

Cloudera Manager uses a quick Python command to test proper resolution.

While it is tempting to rely on /etc/hosts for this step, we recommend using DNS instead. DNS is much less error-prone than using the hosts file and makes changes easier to implement down the line. Hostnames should be set to the fully-qualified domain name (FQDN). It is important to note that using Kerberos requires the use of FQDNs, which is important for enabling security features such as TLS encryption and Kerberos. You can verify this with

If you do use /etc/hosts, ensure that you are listing them in the appropriate order.

Name Service Caching

Hadoop makes extensive use of network-based services such as DNS, NIS, and LDAP. To help weather network hiccups, alleviate stress on shared infrastructure, and improve the latency of name resolution, it can be helpful to enable the name server cache daemon (nscd). nscd caches the results of both local and remote calls in memory, often avoiding a latent round-trip to the network. In most cases you can enable nscd, let it work, and leave it alone. If you’re running Red Hat SSSD, you’ll need to modify the nscd configuration; with SSSD enabled, don’t use nscd to cache passwd, group, or netgroup information.

Link Aggregation

Also known as NIC bonding or NIC teaming, this refers to combining network interfaces for increased throughput or redundancy. Exact settings will depend on your environment.

There are many different ways to bond interfaces. Typically, we recommend bonding for throughput as opposed to availability, but that tradeoff will depend greatly on the number of interfaces and internal network policies. NIC bonding is one of Cloudera’s highest case drivers for misconfigurations. We typically recommend enabling the cluster and verifying everything work before enabling bonding, which will help troubleshoot any issues you may encounter.

VLAN

VLANs are not required, but they can make things easier from the network perspective. It is recommended to move to a dedicated switching infrastructure for production deployments, as much for the benefit of other traffic on the network as anything else. Then make sure all of the Hadoop traffic is on one VLAN for ease of troubleshooting and isolation.

Operating System (OS)

Cloudera Manager does a good job of identifying known and common issues in the OS configuration, but double-check the following:

IPTables

Some customers disable IPTables completely in their initial cluster setup. Doing makes things easier from an administration perspective of course, but also introduces some risk. Depending on the sensitivity of data in your cluster you may wish to enable IP Tables. Hadoop requires many ports to communicate over the numerous ecosystem components but our documentation will help navigate this.

SELinux

It is challenging to construct an SELinux policy that governs all the different components in the Hadoop ecosystem, and so most of our customers run with SELinux disabled. If you are interested in running SELinux make sure to verify that it is on a supported OS version. We recommend only enabling permissive mode initially so that you can capture the output to define a policy that meets your needs.

Swappiness

The traditional recommendation for worker nodes was to set swappiness (vm.swappiness) to 0. However, this behavior changed in newer kernels and we now recommend setting this to 1. (This post has more details.)

Limits

The default file handle limits (aka ulimits) of 1024 for most distributions are likely not set high enough. Cloudera Manager will fix this issue, but if you aren’t running Cloudera Manager, be aware of this fact. Cloudera Manager will not alter users’ limits outside of Hadoop’s default limits. Nevertheless, it is still beneficial to raise the global limits to 64k.

Transparent Huge Pages (THP)

Most Linux platforms supported by CDH 5 include a feature called Transparent Huge Page compaction, which interacts poorly with Hadoop workloads and can seriously degrade performance. Red Hat claims versions past 6.4 patched this bug but there are still remnants that can cause performance issues. We recommend disabling defrag until further testing can be done.

Red Hat/CentOS: /sys/kernel/mm/redhat_transparent_hugepage/defrag
Ubuntu/Debian, OEL, SLES: /sys/kernel/mm/transparent_hugepage/defrag

**Remember to add this to your /etc/rc.local file to make it reboot persistent.**

Time

Make sure you enable NTP on all of your hosts.

Storage

Properly configuring the storage for your cluster is one of the most important initial steps. Failure to do so correctly will lead to pain down the road as changing the configuration can be invasive and typically requires a complete redo of the current storage layer.

OS, Log Drives and Data Drives

Typical 2U machines come equipped with between 16 and 24 drive bays for dedicated data drives, and some number of drives (usually two) dedicated for the operating system and logs. Hadoop was designed with a simple principle: “hardware fails.”  As such, it will sustain a disk, node, or even rack failure. (This principle really starts to take hold at massive scale but let’s face it: if you are reading this blog, you probably aren’t at Google or Facebook.)

Even at normal-person scale (fewer than 4,000 nodes), Hadoop survives hardware failure like a boss but it makes sense to build in a few extra redundancies to reduce these failures. As a general guideline, we recommend using RAID-1 (mirroring) for OS drives to help keep the data nodes ticking a little longer in the event of losing an OS drive. Although this step is not absolutely necessary, in smaller clusters the loss of one node could lead to a significant loss in computing power.

The other drives should be deployed in a JBOD (“Just a Bunch Of Disks”) configuration with individually mounted ext4 partitions on systems running RHEL6+, Debian 7.x, or SLES11+. In some hardware profiles, individual RAID-0 volumes must be used when a RAID controller is mandatory for that particular machine build. This approach will have the same effect as mounting the drives as individual spindles.

There are some mount options that can be useful. These are covered well in Hadoop Operations and by Alex Moundalexis, but echoed here.

Root Reserved Space

By default, both ext3 and ext4 reserve 5% of the blocks on a given filesystem for the root user. This reserve isn’t needed for HDFS data directories, however, and you can adjust it to zero when creating the partition or after using mkfs and tune2fs respectively.

File Access Time

Linux filesystems maintain metadata that records when each file was last accessed—thus, even reads result in a write to disk. This timestamp is called atime and should be disabled on drives configured for Hadoop. Set it via mount option in /etc/fstab:

and apply without reboot.

Directory Permissions

This is a minor point but you should consider changing the permissions on your directories to 700 before you mount data drives. Consequently, if the drives become unmounted, the processes writing to these directories will not fill up the OS mount.

LVM, RAID or JBOD

We are frequently asked whether a JBOD configuration, RAID configuration, or LVM configuration is required. The entire Hadoop ecosystem was created with a JBOD configuration in mind. HDFS is an immutable filesystem that was designed for large file sizes with long sequential reads. This goal plays well with stand-alone SATA drives, as they get the best performance with sequential reads. In summary, whereas RAID is typically used to add redundancy to an existing system, HDFS already has that built in. In fact, using a RAID system with Hadoop can negatively affect performance.

Both RAID-5 and RAID-6 add parity bits into the RAID stripes. These parity bits have to be written and read during standard operations and add significant overhead. Standalone SATA drives will write/read continuously without having to worry about the parity bits since they don’t exist. In contrast, HDFS takes advantage of having numerous individual mount points and can allow individual drives/volumes to fail before the node goes down—which is HDFS’s not-so secret sauce for parallelizing I/O. Setting the drives up in RAID-5 or RAID-6 arrays will create a single array or a couple very large arrays of mount points depending on the drive configuration. These RAID arrays will undermine HDFS’s natural promotion of data protection, slower sequential reads, and data locality of Map tasks.

RAID arrays will also affect other systems that expect numerous mount points. Impala, for example, spins up a thread per spindle in the system, which will perform favorably in a JBOD environment vs. a large single RAID group. For the same reasons, configuring your Hadoop drives under LVM is neither necessary nor recommended.

Deploying Heterogeneously

Many customers purchase new hardware in regular cycles; adding new generations of computing resources makes sense as data volumes and workloads increase. For such environments containing heterogeneous disk, memory, or CPU configurations, Cloudera Manager allows Role Groups, which allow the administrator to specify memory, YARN containers, and Cgroup settings per node or per groups of nodes.

While Hadoop can certainly run with mixed hardware specs, we recommend keeping worker-node configurations homogenous, if possible. In distributed computing environments, workloads are distributed amongst nodes and optimizing for local data access is preferred. Nodes configured with fewer computing resources can become a bottleneck, and running with a mixed hardware configuration could lead to a wider variation in SLA windows. There are a few things to consider:

  • Mixed spindle configuration – HDFS block placement by default works in a round-robin fashion across all the directories specified by dfs.data.dir. If you have, for example, a node with six 1.2TB drives and six 600GB drives, you will fill up the smaller drives more quickly, leading to volume imbalance. Using the Available Space policy requires additional configuration, and in this scenario I/O bound workloads could be affected as you might only be writing to a subset of your disks. Understand the implications of deploying drives in this fashion in advance. Furthermore, if you deploy nodes with more overall storage, remember that HDFS balances by percentage.
  • Mixed memory configuration – Mixing available memory in worker nodes can be problematic as it does require additional configuration.
  • Mixed CPU configuration – Same concept; jobs can be limited by the slowest CPU, effectively negating the benefits of running updated/more cores.

It is important to be cognizant of the points above but remember that Cloudera Manager can help with allocating resources to different hosts; allowing you to easily manage and optimize your configuration.

Cloudera Manager Like A Boss

We highly recommend using  Cloudera Manager to manage your Hadoop cluster. Cloudera Manager offers many valuable features to make life much easier. The Cloudera Manager documentation is pretty clear on this but in order to stamp out any ambiguity, below are the high-level steps to do a production-ready Hadoop deployment with Cloudera Manager.

  1. Set up an external database and pre-create the schemas needed for your deployment.

    (Please change the passwords in the examples above!)
  2. Install the cloudera-manager-server and cloudera-manager-daemons packages per documentation.
  3. Run the scm_prepare_database.shscript specific to your database type.
  4. Start the Cloudera Manager Service and follow the wizard from that point forward.

This is the simplest way to install Cloudera Manager and will get you started with a production-ready deployment in under 20 minutes.

You Play It Out: Services Layout Guide

Given a Cloudera Manager-based deployment, the diagrams below present a rational way to lay out service roles across the cluster in most configurations.

In larger clusters (50+ nodes), a move to five management nodes might be required, with dedicated nodes for the ResourceManager and NameNode pairs. Further, it is not uncommon to use an external database for Cloudera Manager, the Hive Metastore, and so on, and additional HiveServer2 or HMS services could be deployed as well.

We recommend 128GB per management node and 256-512GB for worker nodes. Memory is relatively inexpensive and as computation engines increasingly rely on in-memory execution the additional memory will be put to good use.

Diving a little deeper, the following charts depict the appropriate disk mappings to the various service storage components.

We specify the use of an LVM here for the Cloudera Manager databases but RAID 0 is an option, as well.

Conclusion

Setting up a Hadoop cluster is relatively straightforward once armed with the appropriate knowledge. Take the extra time to procure the right infrastructure and configure it correctly from the start. Following the guidelines described above will give you the best chance for success in your Hadoop deployment and you can avoid fussing with configuration, allowing you to focus your time on solving real business problems—like a boss.

Look for upcoming posts on security and resource management best practices.

Jeff Holoman and Kevin O’Dell are System Engineers at Cloudera.

Facebooktwittergoogle_pluslinkedinmailFacebooktwittergoogle_pluslinkedinmail

18 responses on “How-to: Deploy Apache Hadoop Clusters Like a Boss

  1. Alex Lesser

    We are definitely at a cross roads with enterprise hardware in this “software defined” world. As a hardware manufacturer for 20+ years. I know better than most hardware rates. Clustering groups of commodity machines together is really the best chance organizations have to 99.999% uptimes. However enterprise IT budgets will be bled dry if each node in a cluster is build with redundant everything. I don’t believe we have fully reached a level where sys admins are willing to let go completely and let software provide the extra redundancy protection. I still have many conversations with IT managers who require redundancy where it may not be absolutely critical. This extra cost eats away at overall project goals in my view.

    One of my customers put it rather bluntly when he said “Most enterprises treat hardware like the family dog.” He went on to explain that you never want to see that dog get sick or potentially die. Hadoop and other clustering software packages eliminate the need to protect the family dog and instead deploy hardware that may not be gold plated but is purpose build for application performance.

  2. Jencir Lee

    For once the Cloudera Manager Server failed and I had to re=setup everything. Is it possible to enable HA for the Cloudera Manager Server? Would you have any good solution?

    1. Justin Kestelyn (@kestelyn) Post author

      Please post your issue in the CM section at community.cloudera.com? Easier to interact there.

  3. Miles Yao

    Can you elaborate on the reason for co-locating Name Node and Resource Manager? Isn’t it introducing bottleneck and SPOF? And shouldn’t Name Node receive the most memory?

  4. Kevin O'Dell

    Miles,

    Great question. The reason they are co-located in the system is they are both HA. This is assuming a multi-rack system(think NN/RM node on rack A, and NN/RM on rack B) and will allow for the cluster to survive a rack failure. It is true The Namenode will consume more memory than the Resource Manager, but with modern technology neither consume enough memory or CPU to warrant their own nodes(at reasonable scale…you would want to break some of these services out when looking at 1000+ nodes but that is a different discussion).

  5. Dan Toma

    I’m completely new to Hadoop but, in my very finite wisdom, decided to ride the very steep learning curve of Hadoop and MongoDb for a project in my Distributed Computing class. So far, all I’ve been able to do is install one node of Hadoop on my Mac, a relatively painless – at least, in hindsight – exercise. But to do true distributed computing, I want to install another node of Hadoop, and later grab a data file and do some analytics. I figured MongoDb would help with that, but first I need to install another Hadoop node and link them. I thought of installing two VMs on my Mac, and having each VM store a Hadoop node. I don’t know how – or rather, if – this will work, but some ppl have advised me that Cloudera and AWS are worth investigating for this. I’m starting with Cloudera, and am happy to see a lot of documentation, in general.

    So, my question. If I want real distributed computing, with 2 nodes of Hadoop, could this be accomplished by installing two Cloudera VMs on 1 Mac?

    Sorry for all the rambling.

    Dan

    1. Justin Kestelyn (@kestelyn) Post author

      You’re working too hard. :) If you have an AWS account, your best bet is cloudera.com/live. You can spin up your own distributed cluster there, BYOD, etc.

  6. Keith

    Do you suggest the DFS data drives be mounted at root (/01, /02, /03, etc)? Or perhaps /dfs/01, /dfs/02, /dfs/03, etc …. or does it matter

    1. Robert Marshall

      /u01/dfs, /u02/dfs, /u03/dfs, … Likewise, yarn and impala also get separate directories on each volume: /u01/impala and /u01/yarn, /u02/impala and /u02/yarn, /u03/impala and /u03/yarn, …

  7. Keith

    does one even need to create a partition on the data drives? I can’t see a need to.

    On the Master nodes, is there a reason you wouldn’t use RAID 10 versus JBOD for CM/ZK/JN? (or multiple RAID 1 arrays)

  8. Manohar

    Just curious, how many days it would get to the cluster of 5 data nodes and 3 Edge nodes completed. I am just looking for an approximation
    Thanks
    -Manohar

    1. Kenny

      I have done it in as little as 30 mins on AWS. If you are building on a cloud provider – take a look at Cloudera Director. Otherwise, if you are building in your own data center there will be some manual configurations that will take more time – but in 3-4 days you could easily be running a development cluster.

  9. HooHa

    With the investment needed to create a Hadoop cluster, you’ll need some bucks to even create a 6 node cluster with enough RAM, CPU, and storage.

    You need six physical machines with quad core Xeon ES. At least 24TB of storage.. A 10G switch. You’re looking at a rack that is going to cost you at least $40,000 to even start.

    So many of you want to get involved with Hadoop. You go single node. Then pseudo distributed node. Then the reality hits. You need a fully distributed cluster of high grade enterprise machines.

    Ooops. You spent months learning Hadoop. Now, bend over and open your wallet. This is what separates those who want to tinker with those who have specific business ideas. This is where most engineer types fail because they are so caught up in playing with technology that they don’t know what the hell to do with it.

    It boils down to figuring out if you have a problem to solve using distributed computing. If you don’t, you are wasting your time. You will waste your money too if you get caught up in the hype.

    1. Justin Kestelyn Post author

      Rebuttal:

      “This is where most engineer types fail because they are so caught up in playing with technology that they don’t know what the hell to do with it.”

      Had Doug Cutting and Mike Cafarella followed your advice, Hadoop would not even exist.

    2. Robert Marshall

      You can spin up a 6 node cluster on AWS using Cloudera Director or Vagrant (two different technologies) and run them for an hour or two or as many as you need, and pay for just your needs. If your business use cased demands your own bare metal, then, of course, there should be some payback. That’s what planning is all about.

  10. Dimitris P

    Hi all,
    I’m a complete beginner to this, and I’d like some guidance, as I’m trying to test Apache Spot on a testbed, consisting of a small local network (3-5 clients + server). I would like to direct all the network flow to the edge node (ingest), followed by ML and OA procedures, as described in the documentation, in order to monitor the network activity.

    Then I’m planning to use one of the clients of the network as a threat generator by using a tool (probably metasploit) to transmit suspicious packets to the other clients and see if they are detected. If this is succesful, I ‘m planning to create some triggers to activate specific countermeasures determined by the type of each threat.

    What I’m concerned about is the hardware requirements of each of the nodes, as I haven’t found any relevant info. More specifically, I’m planning to pseudo-distribute the resources of one machine (i7-6core, 16gb ram, >1tb hdd and ssd) to deploy the “Hybrid Hadoop” configuration and create nodes as shown here:
    https://github.com/Open-Network-Insight/open-network-insight/wiki/Hybrid%20Hadoop

    So TLDR:
    1. Is my perspective of pseudo-distributing one machine for the nodes a viable solution for this testbed?
    2. Are there any deployment examples of the Apache Spot solution, in small-scale testebeds at all?
    3. Is it possible to create “hooks” to each type of suspicious activity that will trigger custom countermeasures or alerts, (apart from those shown in the UI of course)?

    Any help would be appreciated!
    Thank you for ur time.

  11. Raja

    1) Which Hyper converged Infra solution provider(S) do you recommend?
    2) Which VM for enterprise do you recommend?