This post was contributed by John Sichi, a committer on the Apache Hive project and a member of the Data Infrastructure team at Facebook.
As many readers may already know, Hive was initially developed at Facebook for dealing with explosive growth in our multi-petabyte data warehouse. Since its release as an Apache project, it has been put into use at a number of other companies for solving big data problems. Hive storage is based on Hadoop‘s underlying append-only filesystem architecture, meaning that it is ideal for capturing and analyzing streams of events (e.g. web logs). However, a data warehouse also has to relate these event streams to application objects; in Facebook’s case, these include familiar items such as fan pages, user profiles, photo albums, or status messages.
Hive can store this information easily, even for hundreds of millions of users, but keeping the warehouse up to date with the latest information published by users can be a challenge, as the append-only constraint makes it impossible to directly apply individual updates to warehouse tables. Up until now, the only practical option has been to periodically pull snapshots of all of the information from live MySQL databases and dump them to new Hive partitions. This is a costly operation, meaning it can be done at most daily (leading to stale data in the warehouse), and does not scale well as data volumes continue to shoot through the roof.
That’s where Apache HBase comes in. HBase is a scaleout table store which can support a very high rate of row-level updates over massive amounts of data. It sidesteps Hadoop’s append-only constraint by keeping recently updated data in memory and incrementally rewriting data to new files, splitting and merging intelligently based on data distribution changes. Since it is based on Hadoop, making HBase interoperate with Hive is straightforward, meaning HBase tables can be accessed as if they were native Hive tables. As a result, a single Hive query can now perform complex operations such as join, union, and aggregation across combinations of HBase and native Hive tables. Likewise, Hive’s INSERT statement can be used to move data between HBase and native Hive tables, or to reorganize data within HBase itself.
Putting it all together, we can solve the incremental refresh problem by keeping a near-real-time replica of MySQL data in HBase (moving only the data which has actually changed), and then combine it with the latest event data in Hive. For static data, native Hive tables are significantly more efficient than HBase for both storage and access, so periodically, we can continue to take snapshots from HBase into Hive tables for use by queries where data freshness is not paramount.
Of course, the scenario described here is just one possible usage, and we look forward to hearing about other innovative applications of the technology as they are discovered by the open source community. Also, since the HBase integration work involved adding a generic storage handler interface to Hive, we are expecting to see development of more storage handler plugins for systems such as Cassandra and HyperTable soon.
The integration effort is still a work in progress, and we are only just now starting to prototype the approach at large data scale, filling in necessary features as we go. However, initial results are encouraging, and we’ll be presenting some of them at the Hadoop Summit at the end of this month. Meanwhile, HBase developers at Cloudera, Facebook, StumbleUpon, Trend Micro and elsewhere are busy adding awesome new features such as bulk load into existing HBase tables; these are likely to increase efficiency and scalability significantly.