This blog post was published on Hortonworks.com before the merger with Cloudera. Some links, resources, or references may no longer be accurate.
Thank you to Vinod Vavilapalli and Saumitra Buragohain for contributing to this blog.
This is the 2nd blog of the Hadoop Blog series (part 1, part 3, part 4, part 5). In this blog, we will show how Apache Hadoop 3 adds value over Apache Hadoop 2 to bring agility and time to market, lower total cost of ownership, scalability and availability and additional new use cases.
Everyone is asking – What is the difference between Apache Hadoop 3 versus Apache Hadoop 2. What’s all this commotion and ruckus mean? What is Hadoop 3 paving the way towards?
Where to start! Hadoop 3 combines the efforts of hundreds of contributors over the last five years since Hadoop 2 launched. Several of these committers work at Hortonworks.
Let’s start with your top value propositions around Hadoop 3 and how it can help your organization.
Agility & Time to Market
Although Hadoop 2 uses containers, Hadoop 3 containerization brings agility and package isolation story of Docker. A container-based service makes it possible to build apps quickly and roll one out in minutes. It also brings faster time to market for services.
Total Cost of Ownership
Hadoop 2 has a lot more storage overhead than Hadoop 3. For example, in Hadoop 2, if there are 6 blocks and 3x replication of each block, the result will be 18 blocks of space.
With erasure coding in Hadoop 3, if there are 6 blocks, it will occupy a 9 block space – 6 blocks and 3 for parity – resulting in less storage overhead. The end result -instead of the 3x hit on storage, the erasure coding storage method will incur an overhead of 1.5x, while maintaining the same level of data recoverability. It halves the storage cost of HDFS while also retaining data durability. Storage overhead can be reduced from 200% to 50%. In addition, you benefit from the tremendous cost savings.
Scalability & Availability
Hadoop 2 and Hadoop 1 only use a single NameNode to manage all Namespaces. Hadoop 3 has multiple Namenodes for multiple namespaces for NameNode Federation which improves scalability.
In Hadoop 2, there is only one standby NameNode. Hadoop 3 supports multiple standby NameNodes. If one standby node goes down over the weekend, you have the benefit of other standby NameNodes so the cluster can continue to operate. This feature gives you a longer servicing window.
Hadoop 2 uses an old timeline service which has scalability issues. Hadoop 3 improves the timeline service v2 and improves the scalability and reliability of timeline service.
New Use Cases
Hadoop 2 doesn’t support GPUs. Hadoop 3 enables scheduling of additional resources, such as disks and GPUs for better integration with containers, deep learning & machine learning. This feature provides the basis for supporting GPUs in Hadoop clusters, which enhances the performance of computations required for Data Science and AI use cases.
Hadoop 2 cannot accommodate intra-node disk balancing. Hadoop 3 has intra-node disk balancing. If you are repurposing or adding new storage to an existing server with older capacity drives, this leads to unevenly disks space in each server. With intra-node disk balancing, the space in each disk is evenly distributed.
Hadoop 2 has only inter-queue preemption across queues. Hadoop 3 introduces intra-queue preemption which goes to the next level time by allowing preemption between application within a single queue. This means that you can prioritize jobs within the queue based on user limits and/or application priority
In conclusion, we are very excited about the upcoming releases on Hadoop 3. The accelerated release schedule plans anticipated for this year will bring even more capabilities into the hands of the users as soon as possible. If you look at the blog published last year called Data Lake 3.0: The Ez Button To Deploy In Minutes And Cut TCO By Half, we will see many of the Data Lake 3.0 architecture and innovations from the Apache Hadoop community come to life in our next release of the Hortonworks Data Platform.