Snappy and Hadoop
Snappy is a compression library developed at Google, and, like many technologies that come from Google, Snappy was designed to be fast. The trade off is that the compression ratio is not as high as other compression libraries. From the Snappy homepage:
… compared to the fastest mode of zlib, Snappy is an order of magnitude faster for most inputs, but the resulting compressed files are anywhere from 20% to 100% bigger.
Snappy is related to the Lempel–Ziv family of compression algorithms, which includes well-known compression algorithms such as LZO, however Snappy offers two clear benefits over LZO in the context of Apache Hadoop.
First, Snappy is significantly faster than LZO for decompression, and comparable for compression, meaning the total round-trip time is superior. Second, Snappy is BSD-licensed, which means that it can be shipped with Hadoop, unlike LZO which is GPL-licensed, and therefore has to be downloaded and installed separately since it may not be included in Apache products.
Why is Snappy useful for Hadoop?
Many Hadoop clusters use LZO compression for intermediate MapReduce output. This output, which is never seen by the user, is always written to disk by the mappers, and then accessed across the network by reducers. It is a prime candidate for compression since it tends to be compressible (there is some redundancy in the key space, since the map outputs are sorted), and because writing to disk is slow it pays to perform some light compression to reduce the number of bytes written (and later read). Snappy and LZO are not CPU intensive, which is important, as other map and reduce processes running at the same time will not be deprived of CPU time. In testing, we have seen that the performance of Snappy is generally comparable to LZO, with up to a 20% improvement in overall job time in some cases.
This use alone justifies installing Snappy, but there are other places Snappy can be used within Hadoop applications. For example, Snappy can be used for block compression in all the commonly-used Hadoop file formats, including Sequence Files, Avro Data Files, and HBase tables.
One thing to note is that Snappy is intended to be used with a container format, like Sequence Files or Avro Data Files, rather than being used directly on plain text, for example, since the latter is not splittable and can’t be processed in parallel using MapReduce. This is different to LZO, where is is possible to index LZO compressed files to determine split points so that LZO files can be processed efficiently in subsequent processing.
How to use Snappy with Hadoop
Snappy support was added to Hadoop in HADOOP-7206, which will be available in the forthcoming 0.23.0 Apache release. Enabling map output compression is as simple as adding the following to mapred-site.xml:
<property> <name>mapred.compress.map.output</name> <value>true</value> </property> <property> <name>mapred.map.output.compression.codec</name> <value>org.apache.hadoop.io.compress.SnappyCodec</value> </property>
Avro support for Snappy is available in the latest release, 1.5.4. And HBase added support in HBASE-3691, which will be available in the 0.92.0 release.
Like many open-source efforts, integrating Snappy with Hadoop was the work of many people. Google developed the algorithm and released the Snappy open source implementation. Issei Yoshida wrote the Snappy compression codec for Hadoop. Doug Cutting added support for Snappy to Avro, using Taro L. Saito’s snappy-java bindings. Nicholas Telford and Nichole Treadway wrote the HBase integration. Alejandro Abdelnur, Bruno Mahé, Roman Shaposhnik and Wing Yew Poon tested the component integration. Patrick Daly and Paul Battaglia wrote the CDH3 documentation.