Cloudera Search: The Newest Hadoop Framework for CDH Users and Developers

Categories: CDH General Hadoop Search

One of the unexpected pleasures of open source development is the way that technologies adapt and evolve for uses you never originally anticipated.

Seven years ago, Apache Hadoop sprang from a project based on Apache Lucene, aiming to solve a search problem: how to scalably store and index the internet. Today, it’s my pleasure to announce Cloudera Search, which uses Lucene (among other things) to make search solve a Hadoop problem: how to let non-technical users interactively explore and analyze data in Hadoop.

Cloudera Search is released to public beta, as of today. (See a demo here; get installation instructions here.) Powered by Apache Solr 4.3, Cloudera Search allows hundreds of users to search petabytes of Hadoop data interactively.

In the context of our platform, CDH (Cloudera’s Distribution including Apache Hadoop), Cloudera Search is another framework much like MapReduce and Cloudera Impala. It’s another way for users to interact with Hadoop data and for developers to build Hadoop applications. Each framework in our platform is designed to cater to different families of applications and users:

While different frameworks appeal to different users and applications, we’ve done no small amount of engineering to enable all of them to work on the same data in the same platform. 

Cloudera Search leverages the same data as Impala and MapReduce. It can index any data stored in HDFS, and it stores its own index in the same filesystem. This is a big step forward in simplicity and usability. Hadoop users will benefit from the ease of automatically indexing and free text searching the data in their clusters. Search users will benefit from the simplicity and affordability of leveraging the widely used HDFS as a basis for storage, data protection, high availability, and disaster recovery.

Cloudera Search leverages the same security as the rest of the Hadoop stack. Data secured in HDFS will not be indexed or viewable by Search users who lack the proper credentials.

Cloudera Search is arguably the most effective convergence of MapReduce, SQL, and Search we’ve seen to date.

Just like HDFS and Apache HBase, Cloudera Search leverages Apache ZooKeeper to support index sharding and high availability.

We’ve also built an exciting new integration between MapReduce and Search we call “push to go live.” With it, outputs of MapReduce jobs can be automatically merged into live Solr indices.

Naturally, Cloudera Search can be deployed, configured, monitored, and automated via Cloudera Manager so users and customers get the benefit of a common management model.

We’ve developed many more ways where Search integrates with the rest of CDH. Search can index streaming Apache Flume feeds. In the future, Search will also be able to index Apache Hive and HBase tables, and Search results will seamlessly feed Impala queries.

In short, we’ve tried to take what was once a relatively complicated and involved freestanding system with its own hardware and operational model and turn it into a feature of a larger, more ubiquitous platform: CDH. We think this integrated approach represents a big step forward for users of Solr as well as Hadoop.

Because we plan to incorporate Search into CDH, we intend to fully support our customers that run it in a production setting. Consequently, part of our development effort for Search has been to convince key committers and PMC members of the Lucene and Solr communities to join Cloudera so we can more easily support every component at a code level. I’m pleased that Cloudera continues to be a place where developers of important and new open source developments want to come and work.

I also want to thank our open source collaborators. Obviously, we’re building on the years of good work of the Lucene and Solr communities. The work we’ve done for Cloudera Search has already resulted in dozens of new patches for the project. In addition, I’d like to thank Aaron McCurry, whose work on Apache Blur (incubating) inspired the HDFS/Solr index integration. Thanks also to the team at NGDATA, whose Solr/HBase integration work we will be incorporating into Cloudera Search in an upcoming release.

I take some small additional pride in the fact that this is arguably the most effective convergence of MapReduce, SQL, and Search that we’ve seen in the data management industry to date. For years, databases attempted to provide search as a feature in their platforms but this approach was largely abandoned in favor of acquiring independent search products that require their own infrastructure, integration, and expertise. Hadoop’s flexibility has made it a much better supporting platform for search and consequently a much more general-purpose platform than relational databases. No wonder the center of gravity for data management has shifted toward Hadoop.

Doug Cutting is Cloudera’s chief architect, a founder of the Apache Lucene and Apache Hadoop projects, and the current chair of the Apache Software Foundation.

Third-party articles about Cloudera Search:
– Wired: “Open Sourcers Build ‘Google Search for Big Data’ “ (6/4/2013)
– The Register: “Cloudera brings Hadoop to the masses with Solr search” (6/4/2013)
– ZDNet: “Search for Big Data: Cloudera and Lucene get hitched” (6/4/2013) 
– GigaOm: “Cloudera adds search to Hadoop distro and says it’s just getting started” (6/4/2013)
– CMSWire: “Cloudera Unveils Big Data Search, No Special Training Required” (6/4/2013) 
– CRN: “Cloudera Adds Search Capabilities To Its Hadoop Big Data Platform” (6/4/2013) 
– ReadWriteWeb: “Searching Hadoop Data Just Got A Lot Easier” (6/5/2013)