Inside Apache HBase’s New Support for MOBs

Categories: HBase

Learn about the design decisions behind HBase’s new support for MOBs.

Apache HBase is a distributed, scalable, performant, consistent key value database that can store a variety of binary data types. It excels at storing many relatively small values (<10K), and providing low-latency reads and writes.

However, there is a growing demand for storing documents, images, and other moderate objects (MOBs)  in HBase while maintaining low latency for reads and writes. One such use case is a bank that stores signed and scanned customer documents. As another example, transport agencies may want to store  snapshots of traffic and moving cars. These MOBs are generally write-once.

Unfortunately, performance can degrade in situations where many moderately sized values (100K to 10MB) are stored due to the ever-increasing  I/O pressure created by compactions. Consider the case where 1TB of photos from traffic cameras, each 1MB in size, are stored into HBase daily. Parts of the stored files are compacted multiple times via minor compactions and eventually, data is rewritten by major compactions. Along with accumulation of these MOBs, I/O created by compactions will slow down the compactions, further block memstore flushing, and eventually block updates. A big MOB store will trigger frequent region splits, reducing the availability of the affected regions.

In order to address these drawbacks, Cloudera and Intel engineers have implemented MOB support in an HBase branch (hbase-11339: HBase MOB). This branch will be merged to the master in HBase 1.1 or 1.2, and is already present and supported in CDH 5.4.x, as well. 

Operations on MOBs are usually write-intensive, with rare updates or deletes and relatively infrequent reads. MOBs are usually stored together with their metadata. Metadata relating to MOBs may include, for instance, car number, speed, and color. Metadata are very small relative to the MOBs. Metadata are usually accessed for analysis, while MOBs are usually randomly accessed only when they are explicitly requested with row keys.

Users want to read and write the MOBs in HBase with low latency in the same APIs, and want strong consistency, security, snapshot and HBase replication between clusters, and so on. To meet these goals, MOBs were moved out of the main I/O path of HBase and into a new I/O path.

In this post, you will learn about this design approach, and why it was selected.

Possible Approaches

There were a few possible approaches to this problem. The first approach we considered was to store MOBs in HBase with a tuned split and compaction policies—a bigger desiredMaxFileSize decreases the frequency of region split, and fewer or no compactions can avoid the write amplification penalty. That approach would improve write latency and throughput considerably. However, along with the increasing number of stored files, there would be too many opened readers in a single store, even more than what is allowed by the OS. As a result, a lot of memory would be consumed and read performance would degrade.

Another approach was to use an HBase + HDFS model to store the metadata and MOBs separately. In this model, a single file is linked by an entry in HBase. This is a client solution, and the transaction is controlled by the client—no HBase-side memories are consumed by MOBs. This approach would work for objects larger than 50MB, but for MOBs, many small files lead to inefficient HDFS usage since the default block size in HDFS is 128MB.

For example, let’s say a NameNode has 48GB of memory and each file is 100KB with three replicas. Each file takes more than 300 bytes in memory, so a NameNode with 48GB memory can hold about 160 million files, which would limit us to only storing 16TB MOB files in total.

As an improvement, we could have assembled the small MOB files into bigger ones—that is, a file could have multiple MOB entries–and store the offset and length in the HBase table for fast reading. However, maintaining data consistency and managing deleted MOBs and small MOB files in compactions are difficult.

Furthermore, if we were to use this approach, we’d have to consider new security policies, lose atomicity properties of writes, and potentially lose the backup and disaster recovery provided by replication and snapshots.

HBase MOB Design

In the end, because most of the concerns around storing MOBs in HBase involve the I/O created by compactions, the key was to move MOBs out of management by normal regions to avoid region splits and compactions there.

The HBase MOB design is similar to the HBase + HDFS approach because we store the metadata and MOBs separately. However, the difference lies in a server-side design: memstore caches the MOBs before they are flushed to disk, the MOBs are written into a HFile called “MOB file” in each flush, and each MOB file has multiple entries instead of single file in HDFS for each MOB. This MOB file is stored in a special region. All the read and write can be used by the current HBase APIs.

Write and Read

Each MOB has a threshold: if the value length of a cell is larger than this threshold, this cell is regarded as a MOB cell.

When the MOB cells are updated in the regions, they are written to the WAL and memstore, just like the normal cells. In flushing, the MOBs are flushed to MOB files, and the metadata and paths of MOB files are flushed to store files. The data consistency and HBase replication features are native to this design.

The MOB edits are larger than usual. In the sync, the corresponding I/O is larger too, which can slow down the sync operations of WAL. If there are other regions that share the same WAL, the write latency of these regions can be affected. However, if the data consistency and non-volatility are needed, WAL is a must.

The cells are permitted to move between stored files and MOB files in the compactions by changing the threshold. The default threshold is 100KB.

As illustrated below, the cells that contain the paths of MOB files are called reference cells. The tags are retained in the cells, so we can continue to rely on the HBase security mechanism.

The reference cells have reference tags that differentiates them from normal cells. A reference tag implies a MOB cell in a MOB file, and thus further resolving is needed in reading.

In reading, the store scanner opens scanners to memstore and store files. If a reference cell is met, the scanner reads the file path from the cell value, and seeks the same row key from that file. The block cache can be enabled for the MOB files in scan, which can accelerate seeking.

It is not necessary to open readers to all the MOB files; only one is needed when required. This random read is not impacted by the number of MOB files. So, we don’t need to compact the MOB files over and over again when they are large enough.

The MOB filename is readable, and comprises three parts: the MD5 of the start key, the latest date of cells in this MOB file, and a UUID. The first part is the start key of the region from where this MOB file is flushed. Usually, the MOBs have a user-defined TTL, so you can find and delete expired MOB files by comparing the second part with the TTL.

Snapshot

To be more friendly to the snapshot, the MOB files are stored in a special dummy region, whereby the snapshot, table export/clone, and archive work as expected.

When storing a snapshot to a table, one creates the MOB region in the snapshot, and adds the existing MOB files into the manifest. When restoring the snapshot, create file links in the MOB region.

Clean and compactions

There are two situations when MOB files should be deleted: when the MOB file is expired, and when the MOB file is too small and should be merged into bigger ones to improve HDFS efficiency.

HBase MOB has a chore in master: it scans the MOB files, finds the expired ones determined by the date in the filename, and deletes them. Thus disk space is reclaimed periodically by aging off expired MOB files.

MOB files may be relatively small compared to a HDFS block if you write rows where only a few entries qualify as MOBs; also, there might be deleted cells. You need to drop the deleted cells and merge the small files into bigger ones to improve HDFS utilization. The MOB compactions only compact the small files and the large files are not touched, which avoids repeated compaction to large files.

Some other things to keep in mind:

  • Know which cells are deleted. In every HBase major compaction, the delete markers are written to a del file before they are dropped.
  • In the first step of MOB compactions, these del files are merged into bigger ones.
  • All the small MOB files are selected. If the number of small files is equal to the number of existing MOB files, this compaction is regarded as a major one and is called an ALL_FILES compaction.
  • These selected files are partitioned by the start key and date in the filename. The small files in each partition are compacted with del files so that deleted cells could be dropped; meanwhile, a new HFile with new reference cells is generated, the compactor commits the new MOB file, and then it bulk loads this HFile into HBase.
  • After compactions in all partitions are finished, if an ALL_FILES compaction is involved, the del files are archived.

The life cycle of MOB files is illustrated below. Basically, they are created when memstore is flushed, and deleted by HFileCleaner from the filesystem when they are not referenced by the snapshot or expired in the archive.

Conclusion

In summary, the new HBase MOB design moves MOBs out of the main I/O path of HBase while retaining most security, compaction, and snapshotting features. It caters to the characteristics of operations in MOB, makes the write amplification of MOBs more predictable, and keeps low latencies in both reading and writing.

Jincheng Du is a Software Engineer at Intel and an HBase contributor.

Jon Hsieh is a Software Engineer at Cloudera and an HBase committer/PMC member. He is also the founder of Apache Flume, and a committer on Apache Sqoop.

facebooktwittergoogle_pluslinkedinmailfacebooktwittergoogle_pluslinkedinmail

7 responses on “Inside Apache HBase’s New Support for MOBs

  1. Gautam Borah

    Hi,

    I loved the idea of MOBs and I am sure it will be very useful when available in HBase 1.1 or later. I have a use case where I might use MOBs in future.

    Our system manages time series performance metrics. We receive metric every minute, this minute level metrics are rolled up every 10 min and then again rolled up every hour. So, in practice if a metric have 60 data points in one hour, 10 minute rolled up data will have 6 data points and 1 hour rolled up data will have a single data point.

    Time rollups are done to apply different retention to the metrics. We retain 1 minute data for 1 daya, 10 min data for 1 week and 1 hour data for 1 year. We create 3 column families and store data accordingly, by setting apropriate TTL to each column family.

    We create a key for a metric for every hour and write each minute level data point as a column value (each qualifier name is the minute), we run batch jobs to rollup this values every 10 minute and 1 hour and write the rolled up values to 2nd and 3rd column family.

    The problem comes with compaction. We receive 1 TB of new metric every day, when we rollup to 1 hour level, we generate around 10 ~ 20 GB of data. So, for 1 year retention we have 3-4 TB of data in the 3rd column family, which gets re-written again and again with every major compaction.

    The metrics received by our system is never updated or modified, once rolled up to 1 hour data points, these values are read only data points and has 1 year of expiry.

    Ideally, we want a compaction policy for our 3rd column family, where compaction does not touch store files older than 10 days or bigger than 20 GB (both should be configurable). Based on the time on these files compaction process just deletes/drops older files to reclaim HDFS space, expired files are never read to merge during compaction.

    Please let me know if this can be achieved through the MOBs.

    Thanks,
    Gautam

  2. Jonathan Hsieh

    I think you can use the mob feature for smaller data (e.g. some of the long running fault injection tests have 4 byte or 10 byte mobs). Turning on mob for that 3rd column family would be a way to avoid having to major compact the archival data (we add essentially a mob compact and mob major compact level that happen even less frequently than normal major compactions).

    I haven’t heard of folks trying this yet but it sounds ok in theory. You will be taking a perf hit on the last cf, (roughly that mob must do two reads instead of just one). If you are ok with that it sounds worth trying.

    HTH,
    Jon

  3. Mike Wallace

    Very helpful post. Is hbase-11339: HBase MOB already released in Hbase 1.1? Or will it be released in HBase 2.0.0?

    Secondly, we are implementing a medical support system. We have a mix of data files ranging from a few KB to greater than 300 MB, about half are greater than 10 MB and the other half are between few KB and 10 MB. If that’s the case, is it a good approach to store the files in HDFS and metadata in HBase? Or should everything be HBase MOBs? Which will be faster to retrieve and show the user?

  4. Jonathan Hsieh

    Mike,

    Our plans to backport mob with upstream Apache HBase 1.1 and 1.2 have fallen through. There is a chance for it to land in 1.3 but that is speculative.

    It is definitely available in what will be upstream Apache HBase 2.0 and in CDH 5.4+’s version of HBase 1.0.0.

    For your second question, the HBase mob feature is not designed to handle data greater than 10MB. In some of our testing we’ve know that could handle it but would not likely be ideal. Two approaches given the current mechanisms we have are:

    1) chunk the large elements into 10mb chunks and have the client app split and reconstruct data.
    2) write a link to a location in hdfs where it the large 300mb objects is stored as a file.

    Both solutions are likely roughly equivalent on retrieval speeds. The first would likely be slower on writes.

    However, the latter solution has some tricky cases related to when clients or servers fail (how do you keep hbase and hdfs in sync) compared to solution 1 (you’ll know it all made it or didn’t make it). Solution 2 is also trickier if you want to do backups or snapshots (you have to capture both the hbase table, and the hdfs data).

    Jon.

  5. Veaceslav Dubenco

    Hi
    This is a very useful feature.
    Thank you for explaining it in details in this article.
    I have two questions:
    1) In our system most of the data to be stored as MOB in HBase are smaller than 10Mb (most of them will be even smaller than 2 – 3Mb). However there are few files (about 5%) that are larger than 10Mb. Some can even reach up to 50Mb.
    This data is never updated and it is read quite rarely (each MOB might need to be read 1 – 2 times per month).
    We are also not planning to use this data in any Map-Reduce jobs or any other processing inside the cluster – we just need to be able to retrieve them once in a while using Java client API.
    Do you see any issue in using the HBase MOB support for this use-case?

    2) If the column family that is used to store the MOBs is configured with compression (i.e. SNAPPY), does the 10Mb limit refer to the original size or the compressed one?

    Thank you.

  6. Sarnath

    An Alternate Strategy
    One could use Hadoop Archive and use HAR URL to access an individual file. This will ease the namenode pressure and at the same time be a clean already existing solution. One can run a Batch Process that updates new incoming Media objects into the HAR — and so the consistency requirements for MOB will be “eventual”.
    Best,
    Sarnath, HCL Tech

  7. Rajesh

    With MOB, It can be used to store and archive emails in HBase. Earlier due to write amplification, there was a performance penalty of storing both binary file and metadata in HBase. With MOB, I think for email archival, this is quite suitable considering mails are not that huge and most of them will be within 10 MB only. If am looking for any disagreements here and if you dont agree, then why? Thanks in Advance. Thanks to the team for providing MOB support and for this brilliant blog.