Cloudera Engineering Blog · Search Posts
In my previous post you learned how to index email messages in batch mode, and in near real time, using Apache Flume with MorphlineSolrSink. In this post, you will learn how to index emails using Cloudera Search with Apache HBase and Lily HBase Indexer, maintained by NGDATA and Cloudera. (If you have not read the previous post, I recommend you do so for background before reading on.)
Which near-real-time method to choose, HBase Indexer or Flume MorphlineSolrSink, will depend entirely on your use case, but below are some things to consider when making that decision:
The rise of Big Data has been pushing search engines to handle ever-increasing amounts of data. While building Cloudera Search, one of the things we considered in Cloudera Engineering was how we would incorporate Apache Solr with Apache Hadoop in a way that would enable near-real-time indexing and searching on really big data.
Eventually, we built Cloudera Search on Solr and Apache Lucene, both of which have been adding features at an ever-faster pace to aid in handling more and more data. However, there is no silver bullet for dealing with extremely large-scale data. A common answer in the world of search is “it depends,” and that answer applies in large-scale search as well. The right architecture for your use case depends on many things, and your choice will generally be guided by the requirements and resources for your particular project.
In December 2012, we described how an internal application built on CDH called Cloudera Support Interface (CSI), which drastically improves Cloudera’s ability to optimally support our customers, is a unique and instructive use case for Apache Hadoop. In this post, we’ll follow up by describing two new differentiating CSI capabilities that have made Cloudera Support yet more responsive for customers:
Why would any company be interested in searching through its vast trove of email? A better question is: Why wouldn’t everybody be interested?
Email has become the most widespread method of communication we have, so there is much value to be extracted by making all emails searchable and readily available for further analysis. Some common use cases that involve email analysis are fraud detection, customer sentiment and churn, lawsuit prevention, and that’s just the tip of the iceberg. Each and every company can extract tremendous value based on its own business needs.
Welcome to our second edition of “This Month in the Ecosystem.” (See the inaugural edition here.) Although August was not as busy as July, there are some very notable highlights to report.
After three months of public beta, and months of private beta before that, Cloudera Search is now generally available. At this milestone, Cloudera has contributed its innovations and IP around the integration of Apache Solr and Apache Lucene with CDH back to the respective upstream projects. The GA of Cloudera Search also signifies the completion of a vast amount of hardening, integration, simplification, and packaging work.
Features of Cloudera Search 1.0 include:
The following guest post is re-published here courtesy of Gerd König, a System Engineer with YMC AG. Thanks, Gerd!
Cloudera Manager is a great tool to orchestrate your CDH-based Apache Hadoop cluster. You can use it from cluster installation, deploying configurations, restarting daemons to monitoring each cluster component. Starting with version 4.6, the manager supports the integration of Cloudera Search, which is currently in Beta state. In this post I’ll show you the required steps to set up a Hadoop cluster via Cloudera Manager and how to integrate Cloudera Search.
This post is the first in a series of blog posts about Cloudera Morphlines, a new command-based framework that simplifies data preparation for Apache Hadoop workloads. To check it out or help contribute, you can find the code here.
Cloudera Morphlines is a new open source framework that reduces the time and effort necessary to integrate, build, and change Hadoop processing applications that extract, transform, and load data into Apache Solr, Apache HBase, HDFS, enterprise data warehouses, or analytic online dashboards. If you want to integrate, build, or facilitate transformation pipelines without programming and without substantial MapReduce skills, and get the job done with a minimum amount of fuss and support costs, this post gets you started.
Doug Cutting’s recent post about Cloudera Search included a hat-tip to Aaron McCurry, founder of the Blur project, for inspiring some of its design principles. We thought you would be interested in hearing more about Blur (which is mentored by Doug and Cloudera’s Patrick Hunt) from Aaron himself – thanks, Aaron, for the guest post below!
Blur is an Apache Incubator project that provides distributed search functionality on top of Apache Hadoop, Apache Lucene, Apache ZooKeeper, and Apache Thrift. When I started building Blur three years ago, there wasn’t a search solution that had a solid integration with the Hadoop ecosystem. Our initial needs were to be able to index our data using MapReduce, store indexes in HDFS, and serve those indexes from clusters of commodity servers while remaining fault tolerant. Blur was built specifically for Hadoop — taking scalability, redundancy, and performance into consideration from the very start — while leveraging all the great features that already exist in the Hadoop stack.
CDH, Cloudera’s 100% open source distribution of Apache Hadoop and related projects, has successfully enabled Big Data processing for many years. The typical approach is to ingest a large set of a wide variety of data into HDFS or Apache HBase for cost-efficient storage and flexible, scalable processing. Over time, various tools to allow for easier access have emerged — so you can now interact with Hadoop through various programming methods and the very familiar structured query capabilities of SQL.
However, many users with less interest in programmatic interaction have been shut out of the value that Hadoop creates from Big Data. And teams trying to achieve more innovative processing struggle with a time-efficient way to interact with, and explore, the data in Hadoop or HBase.