Over the last two years, Cloudera has helped a great number of customers achieve their business objectives on the Apache Hadoop platform. In doing so, we’ve confirmed time and time again that HDFS and MapReduce provide an incredible platform for batch data analysis and other throughput-oriented workloads. However, these systems don’t inherently provide subsecond response times or allow random-access updates to existing datasets. Apache HBase, one of the new components in CDH3b2, addresses these limitations by providing a real time access layer to data on the Hadoop platform.
HBase is a distributed database modeled after BigTable, an architecture that has been in use for hundreds of production applications at Google since 2006. The primary goal of HBase is simple: HBase provides an immensely scalable data store with soft real-time random reads and writes, a flexible data model suitable for structured or complex data, and strong consistency semantics for each row. As HBase is built on top of the proven HDFS distributed storage system, it can easily scale to many TBs of data, transparently handling replicated storage, failure recovery, and data integrity.
In this post I’d like to highlight two common use cases for which we think HBase will be particularly effective:
Analysis of continuously updated data
Many applications need to analyze data that is always changing. For example, an online retail or web presence has a set of customers, visitors, and products that is constantly being updated as users browse and complete transactions. Social or graph-analysis applications may have an ever-evolving set of connections between users and content as new relationships are discovered. In web mining applications, crawlers ingest newly updated web pages, RSS feeds, and mailing lists.
With data access methods available for all major languages, it’s simple to interface data-generating applications like web crawlers, log collectors, or web applications to write into HBase. For example, the next generation of the Nutch web crawler stores its data in HBase. Once the data generators insert the data, HBase enables MapReduce analysis on either the latest data or a snapshot at any recent timestamp.
Most traditional data warehouses focus on enabling businesses to analyze their data in order to make better business decisions. Some algorithm or aggregation is run on raw data to generate rollups, reports, or statistics giving internal decision-makers better information on which to base product direction, design, or planning. Hadoop enables those traditional business intelligence use cases, but also enables data mining applications that feed directly back into the the core product or customer experience. For example, LinkedIn uses Hadoop to compute the “people you may know” feature, improving connectivity inside their social graph. Facebook, Yahoo, and other web properties use Hadoop to compute content- and ad-targeting models to improve stickiness and conversion rates.
These user-facing data applications rely on the ability not just to compute the models, but also to make the computed data available for latency-sensitive lookup operations. For these applications, it’s simple to integrate HBase as the destination for a MapReduce job. Extremely efficient incremental and bulk loading features allow the operational data to be updated while simultaneously serving traffic to latency-sensitive workloads. Compared with alternative data stores, the tight integration with other Hadoop projects as well as the consolidation of infrastructure are tangible benefits.
HBase in CDH3 b2
We at Cloudera are pretty excited about HBase, and are happy to announce it as a first class component in Cloudera’s Distribution for Hadoop v3. HBase is a younger project than other ecosystem projects, but we are committed to doing the work necessary to stabilize and improve it to the same level of robustness and operability as you are used to with the rest of the Cloudera platform.
Since the beginning of this year, we have put significant engineering into the HBase project, and will continue to do so for the future. Among the improvements contributed by Cloudera and newly available in CDH3 b2 are:
- HDFS improvements for HBase – along with the HDFS team at Facebook, we have contributed a number of important bug fixes and improvements for HDFS specifically to enable and improve HBase. These fixes include a working sync feature that allows full durability of every HBase edit, plus numerous performance and reliability fixes.
- Failure-testing frameworks – we have been testing HBase and HDFS under a variety of simulated failure conditions in our lab using a new framework called gremlins. Our mantra is that if HBase can remain available on a cluster infested with gremlins, it will be ultra-stable on customer clusters as well.
- Incremental bulk load – this feature allows efficient bulk load from MapReduce jobs into existing tables.
What’s up next for HBase in CDH3?
Our top priorities for HBase right now are stability, reliability, and operability. To that end, some of the projects we’re currently working on with the rest of the community are:
- Improved Master failover – the HBase master already has automatic failover capability – the HBase team is working on improving the reliability of this feature by tighter integration with Apache ZooKeeper.
- Continued fault testing and bug fixing – the fault testing effort done over the last several months has exposed some bugs that aren’t yet solved – we’ll fix these before marking CDH3 stable.
- Better operator tools and monitoring – work is proceeding on an hbck tool (similar to filesystem fsck) that allows operators to detect and repair errors on a cluster. We’re also improving monitoring and tracing to better understand HBase performance in production systems.
And, as Charles mentioned, we see CDH as a holistic data platform. To that end, several integration projects are already under way:
- Flume integration – capture logs from application servers directly into HBase
- Sqoop integration – efficiently import data from relational databases into HBase
- Hive integration – write queries in a familiar SQL dialect in order to analyze data stored in HBase.
If HBase sounds like it’s a good fit for one of your use cases, I encourage you to head on over to the HBase install guide and try it out today. I’d also welcome any feedback you have – HBase, like CDH3 b2, is still in a beta form, and your feedback is invaluable for improving it as we march towards stability.