Hadoop I/O: Sequence, Map, Set, Array, BloomMap Files
This is a guest repost contributed by Matteo Bertozzi, a Developer at Develer S.r.l.
Apache Hadoop’s SequenceFile provides a persistent data structure for binary key-value pairs. In contrast with other persistent key-value data structures like B-Trees, you can’t seek to a specified key editing, adding or removing it. This file is append-only.
SequenceFile has 3 available formats: An “Uncompressed” format, A “Record Compressed” format and a “Block-Compressed”. All of them share a header that contains some information which allows the reader to recognize is format. There’re Key and Value Class Name’s that allow the reader to instantiate those classes, via reflection, for reading. The version number and format (Is Compressed, Is Block Compressed), if compression is enabled the Compression Codec class name field is added to the header.
The sequence file also can contain a “secondary” key-value list that can be used as file Metadata. This key-value list can be just a Text/Text pair, and is written to the file during the initialization that happens in the SequenceFile.Writer constructor, so you can’t edit your metadata.
As seen Sequence File has 3 available formats, the “Uncompressed” and the “Record Compressed” are really similar. Each call to the append() method adds a record to the sequence file which contains the length of the whole record (key length + value length), the length of the key and the raw data of key and value. The difference between the compressed and the uncompressed version is that the value raw data is compressed, with the specified codec, or not.
In contrast the “Block-Compressed” format is more compression-aggressive. Data is not written until it reaches a threshold, and when the threshold is reached all keys are compressed together–the same happens for the values and the auxiliary lists of key and value lengths.
As you can see in the figure on the left, a block record contains a VInt with the number of the buffered records and 4 compressed blocks that contains a list with the length of the keys, the list of keys, another list with the length of the values and finally the list of values. Before each block a sync marker is written.
Hadoop SequenceFile is the base data structure for the other types of files, like MapFile, SetFile, ArrayFile and BloomMapFile.
The MapFile is a directory that contains two SequenceFile: the data file (“/data”) and the index file (“/index”). The data contains all the key, value records but key N + 1 must be greater then or equal to the key N. This condition is checked during the append() operation, if checkKey fail it throws an IOException “Key out of order”.
The Index file is populated with the key and a LongWritable that contains the starting byte position of the record. Index does’t contains all the keys but just a fraction of the keys, you can specify the indexInterval calling setIndexInterval() method. The Index is read enteirely into memory, so if you’ve large map you can set a index skip value that allows you to keep in memory just a fraction of the index keys.
SetFile and ArrayFile are based on MapFile, and their implementation are just few lines of code. The SetFile instead of append(key, value) as just the key field append(key) and the value is always the NullWritable instance. The ArrayFile as just the value field append(value) and the key is a LongWritable that contains the record number, count + 1. The BloomMapFile extends the MapFile adding another file, the bloom file “/bloom”, and this file contains a serialization of the DynamicBloomFilter filled with the added keys. The bloom file is written entirely during the close operation.
If you want to play with SequenceFile, MapFile, SetFile, ArrayFile without using Java, I’ve written a naive implementation in python. You can find it, in my github repository python-hadoop.