Hadoop for Everyone: Inside Cloudera Search

Categories: Hadoop Search

CDH, Cloudera’s 100% open source distribution of Apache Hadoop and related projects, has successfully enabled Big Data processing for many years. The typical approach is to ingest a large set of a wide variety of data into HDFS or Apache HBase for cost-efficient storage and flexible, scalable processing. Over time, various tools to allow for easier access have emerged — so you can now interact with Hadoop through various programming methods and the very familiar structured query capabilities of SQL.

However, many users with less interest in programmatic interaction have been shut out of the value that Hadoop creates from Big Data. And teams trying to achieve more innovative processing struggle with a time-efficient way to interact with, and explore, the data in Hadoop or HBase.

Helping these users find the data they need without the need for Java, SQL, or scripting languages inspired integrating full-text search functionality, via Cloudera Search (currently in beta), with the powerful processing platform of CDH. The idea of using search on the same platform as other workloads is the key — you no longer have to move data around to satisfy your business needs, as data and indices are stored in the same scalable and cost-efficient platform. You can also not only find what you are looking for, but within the same infrastructure actually “do” things with your data. Cloudera Search brings simplicity and efficiency for large and growing data sets that need to enable mission-critical staff, as well as the average user, to find a needle in an unstructured haystack!

As a workload natively integrated with CDH, Cloudera Search benefits from the same security model, access to the same data pool, and cost-efficient storage. In addition, it is added to the services monitored and managed by Cloudera Manager on the cluster, providing a unified production visibility and rich cluster management – a priceless tool for any cluster admin.

In the rest of this post, I’ll describe some of Cloudera Search’s most important features.

Apache Solr-based for Easy Integration and Production Maturity

Cloudera Search benefits from the same security model, data pool, and cost-efficient storage as CDH.

Cloudera chose to adopt Apache Solr into the Cloudera platform family and contribute innovative, valuable, use-case optimized integrations across the CDH platform. Solr is by far the most mature, active, and product-deployed open source search engine available. With Cloudera’s strong belief in the value of open source, it was a no-brainer to choose Solr after technical introspection of code, studying the active community, and realizing the Solr roadmap (especially with the then alpha-state SolrCloud project – later released in Solr 4).

Solr provides rich APIs and is widely used and integrated, so it was a key goal in Cloudera Search to support standard Solr APIs. Any application relying on Solr APIs should seamlessly migrate, and any third-party integration relying on standard Solr APIs should work as-is.

Robust and Scalable Index Storage in HDFS

HDFS is the storage layer for Cloudera Search. It hosts both data and indices and provides out-of-the-box replication, so if a node goes down, you will be able to continue serving searches elsewhere in the cluster (not to mention the value of cost-efficient index storage in a highly scalable infrastructure).

Scalable Batch Indexing Through MapReduce

MapReduce is the foundation of the batch-oriented indexing workload provided out of the box with Cloudera Search. In the mapper phase, the data is extracted and prepared for indexing. In each reducing phase, data is indexed, the reducers embed Apache Lucene indexers, and those index pieces are merged. The indexing and merging continue to the point that the indexed data is merged into the right number of index shards, written and stored into HDFS.

The MapReduce approach allows you to utilize all available resources assigned for MapReduce, and hence enables a very powerful scalable indexing workload. It also allows for flexibility in the sense that you can easily spin up an on-demand indexing job over a temporary data set, or one that you temporarily need to make searchable. So you don’t need to stick with expensive index storage; Hadoop for Everyone: Inside Cloudera Search.

Immediate Serving of New Indexes via GOLIVE

You can choose to apply the Cloudera-innovated GOLIVE feature and start serving new indices immediately into running Solr servers on the cluster – thanks to the platform integration. No downtime or warmup time for running search servers!

Near Real-Time Indexing at Ingest via Flume

The second option for indexing is via the scalable ingestion framework, Apache Flume. A Cloudera-contributed Flume sink allows for extraction and mapping of data on the Flume side, and immediate writing of indexable data to Solr servers running natively on HDFS. Flume also provides neat features of routing, filtering, and annotating data, as well as splitting channels so you can optionally choose to write data directly into HDFS as well as to a Solr server running on the cluster. The Flume integration allows for data to be indexed on its way into Hadoop.

The Flume ingestion framework scales very well at production sites today, and the indexing at ingest is as scalable as Flume itself — depending on how much data you plan to push through your indexers running on the cluster at a time AND how many indexers you have running, of course. So it is use-case dependent, and Cloudera is more than happy to help you and your use case throughout that deployment and design process.

Near Real-Time Indexing at Scale of HBase Records

Another option for near real-time indexing is via the SEP and index process contributed to Cloudera Search by our partner, NGDATA. It camouflages a replication server, and taps into the replication events generated by a master Apache HBase region server. Via this seemless integration, it can consume all updates and events without disturbing normal HBase activity, and hence with no performance hit. The events are then optionally configured and prepared for indexing and then sent over to the Solr servers running on HDFS, which writes the indexed data as indexes into HDFS. Thus no coprocessors involved, and yet you get a very neat solution for doing secondary indices over your data in HBase.

Simple and Re-usable Data Extraction

Cloudera Morphlines is a very simple framework to do “mini-ETL” of your inbound data to prepare it for multiple indexing workloads (be it via Flume, MapReduce, or HBase). Similar to Unix pipelines, it consists of a library that allows for simple command-based modifications as you parse a data event. Morphlines provides dynamic abilities to change the characteristics of your data on its way to becoming searchable.

Extended File Format Support

The Morphlines framework lets you dynamically change the characteristics of your data on its way to becoming searchable.

Cloudera Search the same standard file formats as Solr (or rather, that Apache Tika supports). We have also added support for Hadoop optimized file formats and compression formats such as Avro, Sequence files, and Snappy.

CDH for Everyone

Cloudera Search also comes with a simple UI provided as an application in Hue. Hue also provides easy applications to interact with other components in CDH (file browser, query applications, workflow scheduler, etc).

Production Management and Monitoring through Cloudera Manager

No solution would be production valid without proper monitoring and management capabilities. Cloudera Manager helps install, deploy, and monitor Search services on the cluster. Over time, there will be much opportunity to provide more advanced insights into search services and activities. Stay tuned!

Hardware Specifications

No specific hardware configurations are needed, but depending on your use case and real-time serving needs, you might want to consider SSD, larger RAM, and so on. That said, Solr works as-is on commodity hardware and has shown great performance on existing CDH clusters where it has been deployed in private beta.

Security and Access

Cloudera Search also provides options for user access, any of which you can consider a per-event (per-document) level access model. One option is to annotate events as they come in through Flume via Flume’s existing interceptor capability. Then, at query time, use fixed filters to control what documents – based on what annotations – get displayed to what user groups. Another option is to utilize index (or collection) level access.

We are in the process of integrating the public beta version of Cloudera Search with Kerberos, and plan to support both authentication and token delegation (or similar solution) to enable authorized users accessing only the collections they have rights to read from. In the MapReduce case, it could be added to do annotations as part of the data extraction and mapping phase, or even as a MapReduce step that sends the data over to the batch indexing process. Conveniently scheduled via an Apache Oozie workflow perhaps?

Standard Solr – Everything Solr That You Already Know

With all this goodness, it is important to remember that Cloudera Search is standard Solr: All your Solr domain knowledge will be applicable to Cloudera Search, with the advantage to perhaps learn even more about Hadoop over time to enable other workloads over the very same data. Basically, you could buy any off-shelf book on Solr 4 and get up to speed on Cloudera Search at the same time. What Cloudera Search adds is the integration points and rich documentation.

Next Steps

And of course, if you know Cloudera, there is obviously much more to come as we continue our journey forward! Stay tuned, however in the meantime, to learn more:

Eva Andreasson is a senior product manager at Cloudera, responsible for Hue, Search, and other projects in CDH.