Apache Hadoop and Apache HBase are gaining popularity due to their flexibility and tremendous work that has been done to simplify their installation and use. This blog is to provide guidance in sizing your first Hadoop/HBase cluster. First, there are significant differences in Hadoop and HBase usage. Hadoop MapReduce is primarily an analytic tool to run analytic and data extraction queries over all of your data, or at least a significant portion of them (data is a plural of datum). HBase is much better for real-time read/write/modify access to tabular data. Both applications are designed for high concurrency and large data sizes. For a general discussions about Hadoop/HBase architecture and differences please refer to Cloudera, Inc. [https://wiki.cloudera.com/display/DOC/Hadoop+Installation+Documentation+for+Cloudera+Enterprise, http://blog.cloudera.com/blog/2010/07/whats-new-in-cdh3-b2-hbase], or Lars George blogs [http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html]. We expect a new edition of the Tom White’s Hadoop book [http://www.hadoopbook.com] and a new HBase book in the near future as well.
|Network||Memory||Disk||CPU||# of nodes|
|HDFS||1GE TOR, 10GE core||8-10 spindles/node||enough nodes to fit the data|
|Hadoop MapReduce||1GE TOR, 10GE core||1-2 GB/task||# of spindles = # of cores||8-12 cores/node, # of tasks = # of hyperthreads – 2|
|HBase||1GE TOR, 10GE core||at least 4GB/node||8-12 cores/node, reduce # of tasks if running with Hadoop DN/TT||enough nodes to fit all regions and serve requests|