Hadoop/HBase Capacity Planning

Apache Hadoop and Apache HBase are gaining popularity due to their flexibility and tremendous work that has been done to simplify their installation and use.  This blog is to provide guidance in sizing your first Hadoop/HBase cluster.  First, there are significant differences in Hadoop and HBase usage.  Hadoop MapReduce is primarily an analytic tool to run analytic and data extraction queries over all of your data, or at least a significant portion of them (data is a plural of datum).  HBase is much better for real-time read/write/modify access to tabular data.  Both applications are designed for high concurrency and large data sizes.  For a general discussions about Hadoop/HBase architecture and differences please refer to Cloudera, Inc. [https://wiki.cloudera.com/display/DOC/Hadoop+Installation+Documentation+for+Cloudera+Enterprise, http://blog.cloudera.com/blog/2010/07/whats-new-in-cdh3-b2-hbase], or Lars George blogs [http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html].  We expect a new edition of the Tom White’s Hadoop book [http://www.hadoopbook.com] and a new HBase book in the near future as well.

Hadoop core is a file system, called HDFS, and the actual MapReduce implementation that can be used to compute on top of the HDFS.  Since we are talking about data, the first crucial parameter is how much disk space we need on all of the Hadoop nodes to store all of your data and what compression algorithm you are going to use to store the data.  For the MapReduce components an important consideration is how much computational power you need to process the data and whether the jobs you are going to run on the cluster is CPU or I/O intensive.  An example of a CPU intensive job is image processing while an I/O intensive job is a simple data loading or aggregation.  Finally, HBase is mainly memory driven and we need to consider the data access pattern in your application and how much memory you need so that the HBase nodes do not swap the data too often to the disk.  Most of the written data end up in memstores before they finally end up on disk, so you should plan for more memory in write-intensive workloads like web crawling.  A good application for HBase is a low latency key-based retrieval and storage of semi-structured data like web crawls or dimensional data for joining with a DW fact table, particularly if the data need update time tracking and can be easily grouped into column families.
General Cloudera hardware recommendations are given here.  This blog will focus on more detailed capacity planning issues.

Network

While the subject of network latency, throughput and bandwidth is very often overlooked when starting to work with Hadoop, it is bound to become a limiting factor as your cluster grows.  Each node in a Hadoop cluster needs to be able to communicate with each other with low latency and high throughput at least to grab the relevant data.  Besides, if the the nodes are not able to communicate with the master node, the master node will automatically think that they are dead and delist them, which will lead to an increased load on the rest of the nodes.  Hadoop will work with off-the-shelf TCP/IP network.
Network load depends on the nature of analytical computations in the cluster.  One simple application that requires a lot of communication between nodes is sorting.  In fact, TeraSort is a good test to detect network issues in the cluster.
A typical configuration is to organize the nodes into racks with a 1GE Top Of Rack (TOR) switch. The racks are typically interconnected by one or more low-latency high-throughput dedicated Layer-2 10GE core switches.  Many customers are happy with ~40 node clusters that can fit onto one rack with a typical 48-port switch.  Even if all of your nodes can fit into one rack but you plan to scale beyond one rack, Cloudera recommends to go with at least two racks from the start to enforce proper practices and network topology scripting.
Network problems can manifest themselves indirectly.  A good practical test is to run a network intensive application like terasort, which sorts 10B 100 byte records (the specific parameters can be adjusted to your cluster size),  on your cluster.  On a 100-node cluster with a quad dual-core CPU hardware the running time should be roughly within 10 minutes (one of our customers sorted 1TB in 6 minutes on a 76-node cluster, the numbers are likely to go down with new 12-core CPU machines).  If you see “Bad connect ack with firstBadLink”, “Bad connect ack”, “No route to host”, or “Could not obtain block” IO exceptions under heavy loads, chances are these are due to a bad network.  Even one slow network card on one of the nodes can slow total job execution as much as a factor of 3-4 since the job completion is limited by the the slowest task.  This problems can also manifest themselves as ‘intermittent’ under heavy loads, but usually go away with proper network configuration and tuning.
Network connection to outside systems is important for loading data into the HDFS and interoperability.  Some companies prefer to have a dedicated high-bandwidth network for loading the data (as opposed to just using VLAN).

Memory

HBase is a very memory hungry application.  Each node in HBase installation, called RegionServer, keeps a number of regions, or chunks of your data, in memory (if caching is enabled).  Ideally, the whole table would be kept in memory but this is not possible with a TB dataset.  Typically, a single RS can handle a few 100s of regions with each 1 or 2GBs (these are configurable parameters).  The number of HBase nodes and memory requirements should be planned accordingly.  From our experience, the memory requirement is at least 4GB/RS for any decent load, but depends significantly on your application load and access pattern.
For Hadoop MapReduce, you want to allocate somewhere between 1GB and 2GB of memory per task on top of the memory allocated for HBase for large clusters:  As the cluster grows, you should plan for a slight overhead in both the tasks memory and the number of simultaneously opened tasktracker connections, controlled by tasktracker.http.threads and mapred.reduce.parallel.copies, to be able to serve more node-to-node connections.
Both Hadoop and HBase memory problems will manifest in slowness of the whole system since both systems were not designed to rely on swapping.  It is recommended to discourage swapping on HBase nodes (set vm.swappiness to 0 or 5 in /etc/sysctl.conf) and to enable GC logging (add “-Xloggc:/var/log/hbase/gc-hbase.log -verbose:gc -XX:+PrintGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime ” to the JVM opts) to look for large GC pauses in the log.  GC pauses longer than 60 seconds can cause RS to go offline (even worse problems can occur if you run a ZK on the same node and it becomes unresponsive), but pauses as long as 1 second usually lead to noticeable responsiveness problems.  For HBase daemons, RS and ZK, Cloudera also recommends to switch to CMS GC (add “-XX:+UseConcMarkSweepGC -XX:-CMSIncrementalMode” to the JVM opts).  There is also work to develop pauseless JVMs.
If a Hadoop node is running an HBase RS daemon together with a Hadoop TT daemon, Cloudera recommends to reduce the maximum number of map/reduce tasks via configuring  mapred.tasktracker.{map,reduce}.tasks.maximum parameter.  You can start with 1-2 map/reduce tasks per tasktracker and slowly increase the number until you see a degradation in the HBase performance.
Often network and memory problems manifest themselves first in ZK [http://wiki.apache.org/hadoop/Hbase/Troubleshooting#A15].  ZK is a distributed lock system and is often called a “canary” of HBase.
vmstat or Ganglia tool should be used to monitor memory status on the RS nodes.  Some VM GC information can be gathered via metrics interface accessible via Jetty interface at <hadoop/hbase-web-ui>/metrics, for example http://node:50060/metrics, if this is properly configured in hadoop-metrics.properties.
One should also keep in mind that even though the system does not get OOM exceptions, the OS and disk I/O performance may be compromised if the system is low on available memory since the system is under GC pressure and less memory is available to OS to buffer I/O (“memory cached”) to speed up other operations.

Disk

First, Hadoop requires at least two locations for storing it’s files: mapred.local.dir, where MapReduce stores intermediary files, and dfs.data.dir, where HDFS stores the HDFS data (there are other locations as well, like hadoop.tmp.dir, where Hadoop and components stores its temporary data).  Both of them can cover multiple partitions.  While the two locations can be placed on physically different partitions, Cloudera recommends to configure them across the same set of partitions to maximize disk-level parallelism (this might not be an issue if the number of disk is much larger than the number of cores).
The sizing guide for HDFS is very simple: each file has a default replication factor of 3 and you need to leave approximately 25% of the disk space for intermediate shuffle files.  So you need 4x times the raw size of the data you will store in the HDFS.  However, the files are rarely stored uncompressed and, depending on the file content and the compression algorithm, on average we have seen a compression ratio of up to 10-20 for the text files stored in HDFS.  So the actual raw disk space required is only about 30-50% of the original uncompressed size.  Compression also helps in moving the data between different systems, e.g. Teradata and Hadoop.
HBase stores the regions in HFiles.  However, during the major compaction the data may be doubled for a given region temporarily.  In addition to HFile storage, there is a small overhead due to WALs, which ideally should be a small portion of the total data size.  Cloudera recommends a 30-50% overhead in terms of free space for HFiles.
While you can run Hadoop MapReduce with only 5-10% of the disk space left, the performance will be compromised due to fragmentation.  Disk performance can be up to 77% slower due to fragmentation and other issues compared to the “empty disk” [http://www.eecs.harvard.edu/vino/fs-perf/papers/keith_a_smith_thesis.pdf].  With a disk more than 80% full you also run the risk of running out of disk space on an individual mount.

CPU

Cloudera recommends total 8 or 12 cores per node, and typically one would have the number of cores equal or slightly larger than the number of spindles.  One would like have the total number of mappers and reducers to be total number of hyperthreads – 2 (2 is for daemons and OS processing) and the ratio of mappers to reducers slightly skewed towards mappers as the reducers tend to spend more time waiting for the mappers.  The importance of CPU power increases with CPU intensive jobs and when using more compute-intensive compression like BZip2.
A typical configuration may be found here.

Summary


  Network Memory Disk CPU # of nodes
HDFS 1GE TOR, 10GE core   8-10 spindles/node   enough nodes to fit the data
Hadoop MapReduce 1GE TOR, 10GE core 1-2 GB/task # of spindles = # of cores 8-12 cores/node, # of tasks = # of hyperthreads – 2  
HBase 1GE TOR, 10GE core at least 4GB/node   8-12 cores/node, reduce # of tasks if running with Hadoop DN/TT enough nodes to fit all regions and serve requests
8 Responses
  • Dani Rayan / March 31, 2012 / 1:47 PM

    10M * 100 bytes = 10^9 bytes which not equal to
    1 TB. IMO, it should be corrected to 10B * 100 bytes.

    • Alex Kozlov / April 02, 2012 / 11:14 AM

      Thank you for your find. I fixed the article. — Alex K

  • Mohammad / June 10, 2012 / 2:48 PM

    Hi – what about RAID? which RAID level should we use? – thanks

  • manish / May 06, 2013 / 9:01 AM

    We are planning to add new node to hadoop however as now the there will be discrepancy CPU, what is the best way to tune the server so we can leverage the new CPU most.
    if we don’t change the CPU, should we add more RAM and storage as newer machines.

Leave a comment


+ nine = 14