Refer to this post (Aug. 28, 2013) for state-of-the-art recommendations about hardware selection for new Hadoop clusters.
great post! Thank you for this overview and sharing your insights.
This is great!
Wow it’s great!!!
I have one question.
Why Raid0 not recommanded at datanode.
We have benchmarks showing that RAID0 is slower than JBOD for a datanode.
Thanks. Glad you all like the post!
Yeah, if your controller is good quality RAID 0 should make a really big difference, but I’ve seen references elsewhere to not using it in Hadoop. I’m wondering the same thing.
Wow great post! Really helpful
This is definitely a good start… Thanks for the great post! Really very helpful for those who want to build their own cluster…
Alex L. –
Did you test with an enterprise-class RAID card for the RAID 0 vs JBOD test? Often low-end commodity servers will have very poor RAID-cards unless specifically chosen, and sometimes even use software RAID, which can be very tricky and not very fast at all.
If you did use an enterprise-level RAID card, then perhaps there is something protocol wise that RAID introduces that aggravates HDFS – and HDFS is able to fundamentally access JBOD at a lower level?
Great post! thanks
Thanks for making it so simple…
When we look at an installation consists of 100 data nodes – couldn’t it be more efficient, in terms of space, power consumption and number of data nodes needed to provide the performance, to use diskless servers (could be 1U or blade) and connect them to a good midrange central storage? This way the storage resources can be shared across all nodes, less disks used and no need to maintain 3 copies of each block (RAID protection within the storage)?
Very good post, Thank you
This is great advice, much more up to date than the Machine Sizing page on the Hadoop wiki.
@SHAI, hadoop loves cheap raw disk as it is optimised for linear reads and writes. SAN would be suboptimal here
I am going to test very soon the hosting of hadoop processes on cloudera…
Excellent post. I agree with “That Tommy Hall’s” post. This is better than the, Machine Sizing page on the Hadoop Wiki.
Have you tried configuring data nodes/ task trackers using a single 8-core, 12-core or 16-core processor? The single socket motherboard servers draw less power than a dual socket system. This solution would offer more cores and draw less power than a typical 2 x quad core node.
I’m a network engineer looking at building the underlying support infrastructure for customers that my firm will be deploying hadoop for. I wanted to point out that although this blog mentions 1GB and 10GB connections, that not all 1GB and 10GB connections are the same.
You need to look at what the oversubscription ratios are on the ports, what the actual switching times are, what the ASICS archtiecture is (blocking vs. non-blocking and to what degree) etc…
My point is simply that you can have the best-designed (from a server perspective) cluster that money can buy, *but*, if you’re trying to use Netgear or Dell switching, you’re not going to be too happy with the results. :-)
Network design and proper equipment spec’ing is just as important as the hadoop design and hardware.
What is your opinion about using consumer grade HDDs for Hadoop nodes? Thanks.
Your comment is timely. See this comprehensive post on Hadoop networking by Brad Hedlund.
goods sharing thanks..
Your computation on namenode memory requirement conflicts with Y! cluster estimate. Your estimate say that 1GB ram is enough to reference 100 million blocks. Y! says that “To store 100 million files (referencing 200 million blocks), a name-node should have at least 60 GB of RAM.”
There have been improvements in scalability between different versions, and the rule of thumb is just to set the right order of magntitude. Actual memory requirements may vary (eg due to different lengths of file names, etc).
Prove you're human! *
× two = 10