Upgrading your clusters and workloads from Apache Hadoop 2 to Apache Hadoop 3

Posted in Technical | September 20, 2018 4 min read

This blog post was published on Hortonworks.com before the merger with Cloudera. Some links, resources, or references may no longer be accurate.

If you’re interested in learning more, go to our recap blog here!

Introduction

The Apache Hadoop community announced Hadoop 3.0 GA in December, 2017 and Hadoop 3.1 in April, 2018 loaded with great features and improvements.

One of the biggest challenges while upgrading to a new major release of a software platform is its compatibility. Apache Hadoop community has focused on ensuring wire and binary compatibility for Hadoop 2 clients and workloads.

There are some challenges to be addressed by admins while upgrading to a major release of Hadoop. Users running workloads on Hadoop 2 should ideally be able to seamlessly run or migrate their workloads onto Hadoop 3. This blog talks about the motivation for upgrading to Hadoop 3 and provide a cluster upgrade guide for admins and workload migration guide for users of Hadoop.

Why upgrade to Hadoop 3?

Hadoop 3 has been an eagerly awaited major release with great features and improvements in both YARN and HDFS. Below, we have highlighted only a few of the newly added features. There are tons of exciting features and improvements that were part of Hadoop 3 which will delight customers and increase Hadoop adoption without costing significantly more.

HDFS

HDFS introduced Erasure coding in Hadoop 3 which enables significant storage cost savings and reduction in storage overhead from 200% to 50%.

HDFS Intra-Data Node disk Balancer is a new feature added in Hadoop 3 which adds the capability to balance data across disks within a single Datanode.

HDFS Federation with support for namespaces for better scalability is GA in Hadoop 3.

YARN

YARN scheduler added a bunch of improvements particularly Capacity Scheduler with changes to scheduler’s core for faster(global) scheduling and native support for new resource types like GPUs.

Support for Containerized applications on YARN ( via Docker) is GA in Hadoop 3 along with support for long running services and service discovery.

Express or Rolling Upgrades?

Admins are recommended to Express Upgrade clusters from Hadoop 2 to 3

Why not rolling upgrades ?

With this being a major version upgrade, there are few challenges and issues in supporting Rolling Upgrades. Some of the open issues are

HDFS Rolling Upgrade has issues with NN restarts due to change in edit log format HDFS-13596
HDFS-6440 Incompatible changes in image transfer protocol
Hadoop daemons configured with a custom MetricsPlugin sink fail to start after upgrade HADOOP-15502 (API In-compatibility)

Compatibility with Hadoop 2

Wire compatibility

Hadoop 3 preserves wire compatibility with Hadoop 2 clients
Distcp/WebHDFS compatibility is preserved

API compatibility

Hadoop 3 doesn’t preserve full API level compatibility due to the following changes

Classpath – Dependency version bumps like guava
Removal of deprecated APIs and tools
Shell script rewrites
Incompatible bug fixes

Major Configuration/Script changes in Hadoop 3

Some of the major configuration/script changes introduced in Hadoop 3 are listed below

Change in Default Daemon Ports for HDFS HDFS-9427
Hadoop Script rewrites HADOOP-9902
Configuring Hadoop Daemon Heap Size HADOOP-10950
- HADOOP_HEAPSIZE deprecated in favour of HADOOP_HEAPSIZE_MAX and HADOOP_HEAPSIZE_MIN
- Auto-tuning based on memory of the host
RM Max Completed Applications in State Store/Memory
- Defaults reduced from 10000 to 1000

For more details on changes in Hadoop 3 and how they would affect upgrades or workloads post upgrade, refer to the slides for the talk we gave recently at Dataworks Summit

Recommended Versions for Upgrade

Recommend users to upgrade to at least Hadoop 2.8.4 before migrating to Hadoop 3.

Upgrade Process

Environment

Java

Hadoop 3 requires Java 8 and above to be deployed on the cluster

Docker

Minimum supported Docker version on Hadoop 3 is 1.12.5

Shell

Minimum version required is Bash V3. POSIX shells are NOT supported

New Service Daemons post upgrade

Hadoop 3 requires additional YARN components in the cluster for running YARN Service applications

YARN Registry DNS
YARN Timeline Service V2.0 Reader

Apache Hadoop Upgrade

Apache Hadoop upgrade is fairly involved with elaborate steps for pre-upgrade, upgrade process and post-upgrade. Please refer to the slides for the talk we gave recently at Dataworks Summit for further details.

Migrating Workloads

MapReduce applications

MapReduce is fully binary compatible and workloads should run as is without any changes required.

Slider apps

Apache Slider is retiring from Apache Incubator and users are required to port their Slider applications to YARN Services. Please refer to our recent blog on Apache Hive LLAP as a YARN Service on how easy it is to port Slider apps to YARN Services.

Benefits of YARN Services

Easier to manage and deploy
Single yarnfile to configure a YARN Service
Supports container placement scheduling such as affinity and anti-affinity YARN-6592
Rolling upgrades for containers and service YARN-7512 and YARN-4726.
Services UI in YARN UI2 for viewing and submitting YARN Service Apps

Versions of Apache projects which support Hadoop 3.x

The following Apache versions already support Hadoop 3 or are in the process of supporting Hadoop 3 in the community.

Conclusion

Express upgrades are the recommended way to go for upgrading from Apache Hadoop 2 to Hadoop 3.

Apache Hadoop community recently released Hadoop 3.1.1 with some features and fixes for bugs identified since Apache Hadoop 3.1.0. Recommend users to upgrade to Apache Hadoop 3.1.1 from Apache Hadoop 2.8.4. For end users, applications on HDFS and YARN should run mostly as is !

References

Dataworks summit San Jose 2018 talk on Apache Hadoop 2 to 3 Upgrades
Community Effort – Apache Hadoop community effort for upgrade related aspects and issues.

Suma Shivaprasad

Staff Software Engineer

More by this author

Rohith Sharma