The recently-released Apache Hive 2.0 contains some exciting improvements, many of which are already available in CDH 5.x.
Recently, the Apache Hive community announced Hive 2.0.0. This is a larger release compared to the previous one (covered here), with a lengthy list of new features (many experimental), enhancements, and bug fixes. Cloudera’s Hive team have been working with the community for months to drive toward this significant release.
Here are some of the highlights with respect to Apache Hive 2.0 (see the release notes for a complete list of features, improvements, and bug fixes):
- HBase metastore (HIVE-9452) – alpha
- LLAP (HIVE-7926) – beta
- HPL/SQL for procedural SQL (HIVE-11055)
- Hive-on-Spark: container prewarm (HIVE-11363)
- CLI mode in Beeline for Hive CLI deprecation (HIVE-10516)
- Hive-on-Spark parallel ORDER BY (HIVE-10458)
Performance and Optimizations
- Hive-on-Spark: Dynamic partition pruning (HIVE-9152)
- Hive-on-Spark: make use of Spark persistence for self union/join (HIVE-10844, HIVE-10550)
- Enable optimized hash tables for Spark (HIVE-11182)
- Hive-on-Spark: vectorized map-join and other join improvements (HIVE-10855, HIVE-10302)
- CBO enhancements (HIVE-10627,HIVE-10686)
- Apache Parquet predicate pushdown (HIVE-11401)
Usability, Supportability, and Stability
- Codahale-based metrics (HIVE-10761)
- HiveServer2 web UI (HIVE-12338)
- More stable and usable Hive-on-Spark (HIVE-8858, HIVE-9139, HIVE-10434, HIVE-10476, HIVE-10594, HIVE-10989, and so on)
Many of the production-ready improvements above are already included, or are scheduled to be included, in the CDH 5.x line, including the HiveServer2 web UI, new metrics, improved Apache Parquet support, and Hive-on-Spark enhancements. Furthermore, the Hive 2.0 release enforces safer configurations and chooses better defaults for certain configurations. (It’s worth noting, however, that the release also contains code that is either no longer supported or on path to deprecation, such as Hadoop-1, MR, and Java 6.)
In conclusion, there is much to be excited about in the Hive 2.0 release, and Cloudera has already backported some of the more significant features and fixes into CDH 5.x. We look forward to working with the rest of the Hive community to further improve and stabilize new features and enhancements along the 2.x release line, and to bring those improvements to CDH users as they become production-ready.
Xuefu Zhang is a Software Engineer at Cloudera and a PMC member of Apache Hive.