Cloudera Developer Blog · Search Posts
Doug Cutting’s recent post about Cloudera Search included a hat-tip to Aaron McCurry, founder of the Blur project, for inspiring some of its design principles. We thought you would be interested in hearing more about Blur (which is mentored by Doug and Cloudera’s Patrick Hunt) from Aaron himself – thanks, Aaron, for the guest post below!
Blur is an Apache Incubator project that provides distributed search functionality on top of Apache Hadoop, Apache Lucene, Apache ZooKeeper, and Apache Thrift. When I started building Blur three years ago, there wasn’t a search solution that had a solid integration with the Hadoop ecosystem. Our initial needs were to be able to index our data using MapReduce, store indexes in HDFS, and serve those indexes from clusters of commodity servers while remaining fault tolerant. Blur was built specifically for Hadoop — taking scalability, redundancy, and performance into consideration from the very start — while leveraging all the great features that already exist in the Hadoop stack.
About three and a half years ago, I had an experience on a project that showed me just how powerful the fault tolerance characteristics of Hadoop are. This is what made me start to think about the core design behind Blur.
CDH, Cloudera’s 100% open source distribution of Apache Hadoop and related projects, has successfully enabled Big Data processing for many years. The typical approach is to ingest a large set of a wide variety of data into HDFS or Apache HBase for cost-efficient storage and flexible, scalable processing. Over time, various tools to allow for easier access have emerged — so you can now interact with Hadoop through various programming methods and the very familiar structured query capabilities of SQL.
However, many users with less interest in programmatic interaction have been shut out of the value that Hadoop creates from Big Data. And teams trying to achieve more innovative processing struggle with a time-efficient way to interact with, and explore, the data in Hadoop or HBase.
Helping these users find the data they need without the need for Java, SQL, or scripting languages inspired integrating full-text search functionality, via Cloudera Search (currently in beta), with the powerful processing platform of CDH. The idea of using search on the same platform as other workloads is the key — you no longer have to move data around to satisfy your business needs, as data and indices are stored in the same scalable and cost-efficient platform. You can also not only find what you are looking for, but within the same infrastructure actually “do” things with your data. Cloudera Search brings simplicity and efficiency for large and growing data sets that need to enable mission-critical staff, as well as the average user, to find a needle in an unstructured haystack!
In version 2.4 of Hue, the open source Web UI that makes Apache Hadoop easier to use, a new app was added in addition to more than 150 fixes: Search!
Using this app, which is based on Apache Solr, you can now search across Hadoop data just like you would do keyword searches with Google or Yahoo! In addition, a wizard lets you tweak the result snippets and tailors the search experience to your needs.
The new Hue Search app uses the regular Solr API underneath the hood, yet adds a remarkable list of UI features that makes using search over data stored in Hadoop a breeze. It integrates with the other Hue apps like File Browser for looking at the index file in a few clicks.
For years, Cloudera has provided virtual machines that give you a working Apache Hadoop environment out-of-the-box. It’s the quickest way to learn and experiment with Hadoop right from your desktop.
We’re constantly updating and improving the QuickStart VM, and in the latest release there are two of Cloudera’s new products that give you easier and faster access to your data: Cloudera Search and Cloudera Impala. We’ve also added corresponding applications to Hue – an open source web-based interface for Hadoop, and the easiest way to interact with your data.
Cloudera Search integrates Apache Solr with the rest of the platform, to let you do full-text search of the data stored in your cluster, just like you would with an online search-engine! Cloudera Impala, on the other hand, lets you execute SQL queries against that same data, on the same platform, and get results back fast enough to interactively explore and analyze. With both these workloads available on the cluster, it eliminates the pain of having to move large data sizes around.
Earlier this week, we hosted The Cloudera Forum to reveal Cloudera’s “Unaccept the Status Quo” vision and to announce the public beta launch of Cloudera Search. The event featured a panel discussion between representatives from four companies that are embracing the latest big data innovations, moderated by our own CEO Mike Olson. Those are the companies I’d like to highlight in this week’s spotlight, for obvious reasons. The panelists were… (drumroll, please):
What do you do at Cloudera (and in which Apache project(s) are you involved)?
I’m a software engineer on the Search team. I’ve been involved in the Apache Lucene community since 2006 and Apache Solr since around 2009. I spend a lot of time adding features to Solr and fixing bugs, as well as working on improving Solr integration with the rest of the Hadoop ecosystem. I kind of think of myself as a “distributed search guy” at the moment.
The news this morning focused on the launch of Cloudera Search, an exciting new capability for our platform that was much anticipated by our customers and engineers. Also released at the same time is a new release of Cloudera Manager (4.6).
Cloudera Manager 4.6 includes a number of enhancements as well as improvements in quality and usability. (A follow-on blog post will do a deep dive on the new features and functions.) Most notable in Cloudera Manager 4.6 is that the free version (included in Cloudera Standard) is greatly enhanced. Cloudera Standard now includes monitoring, health checks, events & alerts, log search, kerberos automation, and multi-cluster support.
There are a few motivations for this update:
One of the unexpected pleasures of open source development is the way that technologies adapt and evolve for uses you never originally anticipated.
Seven years ago, Apache Hadoop sprang from a project based on Apache Lucene, aiming to solve a search problem: how to scalably store and index the internet. Today, it’s my pleasure to announce Cloudera Search, which uses Lucene (among other things) to make search solve a Hadoop problem: how to let non-technical users interactively explore and analyze data in Hadoop.
Cloudera Search is released to public beta, as of today. (See a demo here; get installation instructions here.) Powered by Apache Solr 4.3, Cloudera Search allows hundreds of users to search petabytes of Hadoop data interactively.