Cloudera’s Support Team Shares Some Basic Hardware Recommendations

Refer to this post (Aug. 28, 2013) for state-of-the-art recommendations about hardware selection for new Hadoop clusters.

Filed under:

29 Responses
  • Abraham / March 31, 2010 / 12:33 AM

    Alex,

    great post! Thank you for this overview and sharing your insights.

    Abraham

  • Joe Stein / April 07, 2010 / 9:26 AM

    This is great!

  • Ygee / June 10, 2010 / 6:29 PM

    Wow it’s great!!!

    I have one question.
    Why Raid0 not recommanded at datanode.

  • Alex Loddengaard / June 14, 2010 / 8:22 AM

    We have benchmarks showing that RAID0 is slower than JBOD for a datanode.

    Thanks. Glad you all like the post!

    Alex

  • anon / June 16, 2010 / 6:51 PM

    Yeah, if your controller is good quality RAID 0 should make a really big difference, but I’ve seen references elsewhere to not using it in Hadoop. I’m wondering the same thing.

  • Lekhnath / July 07, 2010 / 1:01 AM

    Wow great post! Really helpful

  • Marcello de Sales / August 12, 2010 / 3:05 AM

    This is definitely a good start… Thanks for the great post! Really very helpful for those who want to build their own cluster…

    Marcello

  • anon / September 19, 2010 / 11:54 PM

    Alex L. –

    Did you test with an enterprise-class RAID card for the RAID 0 vs JBOD test? Often low-end commodity servers will have very poor RAID-cards unless specifically chosen, and sometimes even use software RAID, which can be very tricky and not very fast at all.

    If you did use an enterprise-level RAID card, then perhaps there is something protocol wise that RAID introduces that aggravates HDFS – and HDFS is able to fundamentally access JBOD at a lower level?

  • James / March 29, 2011 / 7:47 PM

    Great post! thanks

  • Shai / August 01, 2011 / 10:25 PM

    Hi Alex,

    Thanks for making it so simple…
    When we look at an installation consists of 100 data nodes – couldn’t it be more efficient, in terms of space, power consumption and number of data nodes needed to provide the performance, to use diskless servers (could be 1U or blade) and connect them to a good midrange central storage? This way the storage resources can be shared across all nodes, less disks used and no need to maintain 3 copies of each block (RAID protection within the storage)?

  • Kal / September 26, 2011 / 8:27 AM

    Very good post, Thank you

  • thattommyhall / October 11, 2011 / 6:36 PM

    This is great advice, much more up to date than the Machine Sizing page on the Hadoop wiki.

    @SHAI, hadoop loves cheap raw disk as it is optimised for linear reads and writes. SAN would be suboptimal here

  • Web Hosting / November 19, 2011 / 1:52 AM

    I am going to test very soon the hosting of hadoop processes on cloudera…

  • Mike Chikuni / December 29, 2011 / 7:55 AM

    Excellent post. I agree with “That Tommy Hall’s” post. This is better than the, Machine Sizing page on the Hadoop Wiki.

  • Jake Solis / February 01, 2012 / 11:46 AM

    Alex,

    Have you tried configuring data nodes/ task trackers using a single 8-core, 12-core or 16-core processor? The single socket motherboard servers draw less power than a dual socket system. This solution would offer more cores and draw less power than a typical 2 x quad core node.

  • Russ Jensen / March 07, 2012 / 10:05 AM

    Hey all-

    I’m a network engineer looking at building the underlying support infrastructure for customers that my firm will be deploying hadoop for. I wanted to point out that although this blog mentions 1GB and 10GB connections, that not all 1GB and 10GB connections are the same.

    You need to look at what the oversubscription ratios are on the ports, what the actual switching times are, what the ASICS archtiecture is (blocking vs. non-blocking and to what degree) etc…

    My point is simply that you can have the best-designed (from a server perspective) cluster that money can buy, *but*, if you’re trying to use Netgear or Dell switching, you’re not going to be too happy with the results. :-)

    Network design and proper equipment spec’ing is just as important as the hadoop design and hardware.

  • Gus / March 26, 2012 / 1:37 PM

    Alex,

    What is your opinion about using consumer grade HDDs for Hadoop nodes? Thanks.

  • Patrick Angeles / March 27, 2012 / 5:45 PM

    Russ,

    Your comment is timely. See this comprehensive post on Hadoop networking by Brad Hedlund.

    http://bradhedlund.com/2012/03/26/considering-10ge-hadoop-clusters-and-the-network/

  • canlı tv / June 21, 2012 / 8:50 PM

    goods sharing thanks..

  • bubby / July 17, 2012 / 11:42 PM

    Your computation on namenode memory requirement conflicts with Y! cluster estimate. Your estimate say that 1GB ram is enough to reference 100 million blocks. Y! says that “To store 100 million files (referencing 200 million blocks), a name-node should have at least 60 GB of RAM.”
    http://developer.yahoo.com/blogs/hadoop/posts/2010/05/scalability_of_the_hadoop_dist/

    • Jon Zuanich / July 18, 2012 / 7:01 PM

      There have been improvements in scalability between different versions, and the rule of thumb is just to set the right order of magntitude. Actual memory requirements may vary (eg due to different lengths of file names, etc).

Leave a comment


5 + two =