This blog post was published on Hortonworks.com before the merger with Cloudera. Some links, resources, or references may no longer be accurate.
If you’re interested in learning more, go to our recap blog here!
The Apache Hadoop community announced Hadoop 3.0 GA in December, 2017 and Hadoop 3.1 in April, 2018 loaded with great features and improvements.
One of the biggest challenges while upgrading to a new major release of a software platform is its compatibility. Apache Hadoop community has focused on ensuring wire and binary compatibility for Hadoop 2 clients and workloads.
There are some challenges to be addressed by admins while upgrading to a major release of Hadoop. Users running workloads on Hadoop 2 should ideally be able to seamlessly run or migrate their workloads onto Hadoop 3. This blog talks about the motivation for upgrading to Hadoop 3 and provide a cluster upgrade guide for admins and workload migration guide for users of Hadoop.
Why upgrade to Hadoop 3?
Hadoop 3 has been an eagerly awaited major release with great features and improvements in both YARN and HDFS. Below, we have highlighted only a few of the newly added features. There are tons of exciting features and improvements that were part of Hadoop 3 which will delight customers and increase Hadoop adoption without costing significantly more.
HDFS introduced Erasure coding in Hadoop 3 which enables significant storage cost savings and reduction in storage overhead from 200% to 50%.
HDFS Intra-Data Node disk Balancer is a new feature added in Hadoop 3 which adds the capability to balance data across disks within a single Datanode.
HDFS Federation with support for namespaces for better scalability is GA in Hadoop 3.
YARN scheduler added a bunch of improvements particularly Capacity Scheduler with changes to scheduler’s core for faster(global) scheduling and native support for new resource types like GPUs.
Support for Containerized applications on YARN ( via Docker) is GA in Hadoop 3 along with support for long running services and service discovery.
Express or Rolling Upgrades?
Admins are recommended to Express Upgrade clusters from Hadoop 2 to 3
Why not rolling upgrades ?
With this being a major version upgrade, there are few challenges and issues in supporting Rolling Upgrades. Some of the open issues are
- HDFS Rolling Upgrade has issues with NN restarts due to change in edit log format HDFS-13596
- HDFS-6440 Incompatible changes in image transfer protocol
- Hadoop daemons configured with a custom MetricsPlugin sink fail to start after upgrade HADOOP-15502 (API In-compatibility)
Compatibility with Hadoop 2
- Hadoop 3 preserves wire compatibility with Hadoop 2 clients
- Distcp/WebHDFS compatibility is preserved
Hadoop 3 doesn’t preserve full API level compatibility due to the following changes
- Classpath – Dependency version bumps like guava
- Removal of deprecated APIs and tools
- Shell script rewrites
- Incompatible bug fixes
Major Configuration/Script changes in Hadoop 3
Some of the major configuration/script changes introduced in Hadoop 3 are listed below
- Change in Default Daemon Ports for HDFS HDFS-9427
- Hadoop Script rewrites HADOOP-9902
- Configuring Hadoop Daemon Heap Size HADOOP-10950
- HADOOP_HEAPSIZE deprecated in favour of HADOOP_HEAPSIZE_MAX and HADOOP_HEAPSIZE_MIN
- Auto-tuning based on memory of the host
- RM Max Completed Applications in State Store/Memory
- Defaults reduced from 10000 to 1000
For more details on changes in Hadoop 3 and how they would affect upgrades or workloads post upgrade, refer to the slides for the talk we gave recently at Dataworks Summit
Recommended Versions for Upgrade
Recommend users to upgrade to at least Hadoop 2.8.4 before migrating to Hadoop 3.
Hadoop 3 requires Java 8 and above to be deployed on the cluster
Minimum supported Docker version on Hadoop 3 is 1.12.5
Minimum version required is Bash V3. POSIX shells are NOT supported
New Service Daemons post upgrade
Hadoop 3 requires additional YARN components in the cluster for running YARN Service applications
- YARN Registry DNS
- YARN Timeline Service V2.0 Reader
Apache Hadoop Upgrade
Apache Hadoop upgrade is fairly involved with elaborate steps for pre-upgrade, upgrade process and post-upgrade. Please refer to the slides for the talk we gave recently at Dataworks Summit for further details.
MapReduce is fully binary compatible and workloads should run as is without any changes required.
Apache Slider is retiring from Apache Incubator and users are required to port their Slider applications to YARN Services. Please refer to our recent blog on Apache Hive LLAP as a YARN Service on how easy it is to port Slider apps to YARN Services.
Benefits of YARN Services
- Easier to manage and deploy
- Single yarnfile to configure a YARN Service
- Supports container placement scheduling such as affinity and anti-affinity YARN-6592
- Rolling upgrades for containers and service YARN-7512 and YARN-4726.
- Services UI in YARN UI2 for viewing and submitting YARN Service Apps
Versions of Apache projects which support Hadoop 3.x
The following Apache versions already support Hadoop 3 or are in the process of supporting Hadoop 3 in the community.
Express upgrades are the recommended way to go for upgrading from Apache Hadoop 2 to Hadoop 3.
Apache Hadoop community recently released Hadoop 3.1.1 with some features and fixes for bugs identified since Apache Hadoop 3.1.0. Recommend users to upgrade to Apache Hadoop 3.1.1 from Apache Hadoop 2.8.4. For end users, applications on HDFS and YARN should run mostly as is !
- Dataworks summit San Jose 2018 talk on Apache Hadoop 2 to 3 Upgrades
- Community Effort – Apache Hadoop community effort for upgrade related aspects and issues.