The Apache Hadoop project announced the release of 3.0.0-alpha2 on January 25th, 2017. This is the second alpha release in the 3.0.0 release series leading up to 3.0.0 GA, and incorporates 857 new fixes, improvements, and features since 3.0.0-alpha1 last September. It’s worth reading our previous blog post about 3.0.0-alpha1; in this post, we’ll discuss the new improvements that landed in alpha2.
Classpath Isolation for Hadoop Client Jars
The pain of classpath isolation has been experienced by many Java developers. It’s essentially a problem of conflicting dependency versions; the Hadoop client may require a specific version of a Java library to be present on the application’s classpath, but the application is already using a different, incompatible version of that same library. This can result in ClassNotFoundException or NoSuchMethodError exceptions at runtime, or otherwise unknown, untested behavior.
This problem is partially addressed by the new shaded client artifacts introduced by HADOOP-11804. Shading creates a JAR that also includes all of its dependencies, similar to static linking. The shaded Hadoop client thus doesn’t require additional dependencies to be added to the application’s classpath, letting the application freely use whatever dependencies and versions it so chooses.
Support for Microsoft Azure Data Lake and Aliyun Object Storage System
Apache Hadoop has added filesystem connectors for Microsoft Azure Data Lake and Aliyun Object Storage System. This allows users to interact with these storage systems via the normal Hadoop filesystem APIs.
Support for Opportunistic Containers and Distributed Scheduling
YARN introduces the notion of opportunistic containers in addition to the current guaranteed containers. An opportunistic container is queued at the NodeManager waiting for resources to become available, and run opportunistically so long as resources are available. They are preempted, if and when needed, to make room for guaranteed containers. Running opportunistic containers between the completion of a guaranteed container and the allocation of a new one should improve cluster utilization.
In their current form, applications need to explicitly request opportunistic containers. These opportunistic containers are best suited for short-running tasks. Opportunistic containers are allocated by the central RM by default. There is also support for an external (potentially distributed) scheduler to queue opportunistic containers.
Please see the documentation for more details.
The Apache Hadoop 3.0.0 release series continues to grow and improve based on community feedback, reflected in the recent 3.0.0-alpha2 release. The current upstream release plan is for one more alpha release to finalize HDFS erasure coding and a few other features before moving on to beta1 (and then GA). This makes the next development phases a crucial time to integrate additional user feedback before we freeze compatibility for beta.
So, download the release and try out new features like the shaded client or erasure coding, and file a JIRA with any bugs or improvements. If you’re interested in getting more involved with Hadoop 3 release validation, please email the dev lists or feel free to reach out to one of us here at Cloudera directly.
Andrew Wang is a software engineer on Cloudera’s HDFS team, an Apache Hadoop PMC member and committer, and the release manager for Hadoop 3.
Ray Chiang is a software engineer on Cloudera’s RM team, and an Apache Hadoop committer.