Cloudera Search over Apache HBase: A Story of Collaboration
Thanks to Steven Noels, SVP of Products for NGDATA, for the guest post below.
NGDATA builds and sells Lily, the next-generation Customer Intelligence Platform that helps enterprise marketing teams collect and store customer interaction data in order to profile, segment, and present better offers. We designed Lily from the ground up to run on Apache HBase and Apache Solr. Combining these technologies with our deep marketing segmentation expertise and unique machine learning techniques we’re able to deliver interactive data management, real-time statistical calculations, faceted search views of customers, offers, interactions and the permutations they each inspire.
The team at NGDATA has been working since mid-2010 on HBase triggers (or update notifications, if you want), which we use in Lily to sync up Solr with HBase, to make HBase freely searchable, compute indexed views for data exploration and feed our online machine learning engine with customer behavior information. The foundation portion of our platform – the Lily Data Repository, based on the combination of HBase and Solr – is being used by large banks, media companies and pharmaceutical firms who value combing Apache Hadoop’s data storage and parallel data processing framework with ad-hoc search and discovery through Solr.
Enter Cloudera. In 2011, Cloudera added support for HBase to CDH, in alignment and confirmation of our vision of HBase being an ideal platform for capturing and processing customer interaction data. And, now, Cloudera has added Search to CDH, adapting, and improving Solr to co-exist with the Hadoop data infrastructure. Cloudera users now have access to MapReduce for parallel data processing, Impala for ad-hoc SQL querying, and Solr Search for ad-hoc data discovery, all running on top of the same data stored in Hadoop and HBase. (Over time, we invested a great deal into making Solr work well with HBase, continuously improving and expanding our triggering and indexing mechanism, which we first released as part of Lily mid 2010.)
We are pleased with the collaboration, innovation, and quality that Cloudera has produced by working with us.
Of course, we are an applications company, not an infrastructure company. Last year, we initiated a conversation with Cloudera to collaborate on making HBase triggering and indexing part of a larger, more complete offer in CDH, the platform we had already selected as our base. Cloudera and NGDATA both believe that HBase will often service use cases of real-time data ingestion and data serving, with Search being an integral part of that. In line with the Apache spirit, we contributed and collaborated with some outstanding engineers at Cloudera on an improved triggering and indexing mechanism, based on the design DNA of our previous inventions.
In this most recent edition, we introduced an order of magnitude performance improvement: a cleaner, more efficient, and fault-tolerant code path with no write performance penalty on HBase. In the interest of modularity, we decoupled the trigger and indexing component from Lily, making it into a stand-alone, collaborative open source project that is now underpinning both Cloudera Search HBase support as well as Lily.
This made sense for us, not just because we believe in HBase and its community but because our customers in Banking, Media, Pharma and Telecom have unqualified expectations for both the scalability and resilience of Lily. Outsourcing some part of that responsibility towards the infrastructure tier is efficient for us. We are very pleased with the collaboration, innovation, and quality that Cloudera has produced by working with us and look forward to a continued relationship that combines joint development in a community oriented way with responsible stewardship of the infrastructure code base we build upon.
Our HBase Triggering and Indexing software can be found on GitHub at:
Do you have any indexing or update side-effect needs for HBase? Tell us your thoughts on this solution.