Hadoop 2.3.0 includes hundreds of new fixes and features, but none more important than HDFS caching.
The Apache Hadoop community has voted to release Hadoop 2.3.0, which includes (among many other things):
- In-memory caching for HDFS, including centralized administration and management
- Groundwork for future support of heterogeneous storage in HDFS
- Simplified distribution of MapReduce binaries via the YARN Distributed Cache
You can read the release notes here. Congratulations to everyone who contributed!
As noted above, one of the major new features in Hadoop 2.3.0 is HDFS caching, which enables memory-speed reads in HDFS. This feature was developed by two engineers/Hadoop committers at Cloudera: Andrew Wang and Colin McCabe.
HDFS caching lets users explicitly cache certain files or directories in HDFS. DataNodes will then cache the corresponding blocks in off-heap memory through the use of
mlock. Once cached, Hadoop applications can query the locations of cached blocks and place their tasks for memory-locality. Finally, when memory-local, applications can use the new zero-copy read API to read cached data with no additional overhead. Preliminary benchmarks show that optimized applications can achieve read throughput on the order of gigabytes per second.
Better yet, this feature will be landing in CDH 5.0 (which is based on Hadoop 2.3.0) when it ships alongside corresponding Impala improvements that take advantage of these new APIs for improved performance. So, you can look forward to an even faster Impala in the new release!
Justin Kestelyn is Cloudera’s developer outreach director.