Tag Archives: hadoop i/o

Apache HBase I/O – HFile

Categories: HBase

Introduction

Apache HBase is the Hadoop open-source, distributed, versioned storage manager well suited for random, realtime read/write access.

Wait wait? random, realtime read/write access?
How is that possible? Is not Hadoop just a sequential read/write, batch processing system?

Yes, we’re talking about the same thing, and in the next few paragraphs, I’m going to explain to  you how HBase achieves the random I/O, how it stores data and the evolution of the HBase’s HFile format.

Read more

Hadoop I/O: Sequence, Map, Set, Array, BloomMap Files

Categories: Hadoop MapReduce

This is a guest repost contributed by Matteo Bertozzi, a Developer at Develer S.r.l.

Apache Hadoop’s SequenceFile provides a persistent data structure for binary key-value pairs. In contrast with other persistent key-value data structures like B-Trees, you can’t seek to a specified key editing, adding or removing it. This file is append-only.

SequenceFile has 3 available formats: An “Uncompressed” format, A “Record Compressed”

Read more