Cloudera Developer Blog · Search Posts
You can use Hue and Cloudera Search to build your own integrated Big Data search app.
In a previous post, you learned how to analyze data using Apache Hive via Hue’s Beeswax and Catalog apps. This time, you’ll see how to make Yelp Dataset Challenge data searchable by indexing it and building a customizable UI with the Hue Search app.
Indexing Data in Cloudera Search
Indexing data in Cloudera Search involves :
Cloudera Manager 4.7 added support for managing Cloudera Search 1.0. Thus Cloudera Manager users can easily deploy all components of Cloudera Search (including Apache Solr) and manage all related services, just like every other service included in CDH (Cloudera’s distribution of Apache Hadoop and related projects).
In this how-to, you will learn the steps involved in adding Cloudera Search to a Cloudera Enterprise (CDH + Cloudera Manager) cluster.
Installing the SOLR Parcel
In our example, the cluster uses a CDH 4.4 parcel and is running Apache ZooKeeper, HDFS, and Apache HBase services. (Parcels are a really useful way to deploy new software and do painless upgrades via Cloudera Manager.)
In my previous post you learned how to index email messages in batch mode, and in near real time, using Apache Flume with MorphlineSolrSink. In this post, you will learn how to index emails using Cloudera Search with Apache HBase and Lily HBase Indexer, maintained by NGDATA and Cloudera. (If you have not read the previous post, I recommend you do so for background before reading on.)
Which near-real-time method to choose, HBase Indexer or Flume MorphlineSolrSink, will depend entirely on your use case, but below are some things to consider when making that decision:
The rise of Big Data has been pushing search engines to handle ever-increasing amounts of data. While building Cloudera Search, one of the things we considered in Cloudera Engineering was how we would incorporate Apache Solr with Apache Hadoop in a way that would enable near-real-time indexing and searching on really big data.
Eventually, we built Cloudera Search on Solr and Apache Lucene, both of which have been adding features at an ever-faster pace to aid in handling more and more data. However, there is no silver bullet for dealing with extremely large-scale data. A common answer in the world of search is “it depends,” and that answer applies in large-scale search as well. The right architecture for your use case depends on many things, and your choice will generally be guided by the requirements and resources for your particular project.
We wanted to make sure that one simple scaling strategy that has been commonly used in the past for large amounts of time-series data would be fairly simple to set up with Cloudera Search. By “time-series data,” I mean logs, tweets, news articles, market data, and so on — data that is continuously being generated and is easily associated with a current timestamp.
In December 2012, we described how an internal application built on CDH called Cloudera Support Interface (CSI), which drastically improves Cloudera’s ability to optimally support our customers, is a unique and instructive use case for Apache Hadoop. In this post, we’ll follow up by describing two new differentiating CSI capabilities that have made Cloudera Support yet more responsive for customers:
Why would any company be interested in searching through its vast trove of email? A better question is: Why wouldn’t everybody be interested?
Email has become the most widespread method of communication we have, so there is much value to be extracted by making all emails searchable and readily available for further analysis. Some common use cases that involve email analysis are fraud detection, customer sentiment and churn, lawsuit prevention, and that’s just the tip of the iceberg. Each and every company can extract tremendous value based on its own business needs.
A little over a year ago we described how to archive and index emails using HDFS and Apache Solr. However, at that time, searching and analyzing emails were still relatively cumbersome and technically challenging tasks. We have come a long way in document indexing automation since then — especially with the recent introduction of Cloudera Search, it is now easier than ever to extract value from the corpus of available information.
Welcome to our second edition of “This Month in the Ecosystem.” (See the inaugural edition here.) Although August was not as busy as July, there are some very notable highlights to report.
After three months of public beta, and months of private beta before that, Cloudera Search is now generally available. At this milestone, Cloudera has contributed its innovations and IP around the integration of Apache Solr and Apache Lucene with CDH back to the respective upstream projects. The GA of Cloudera Search also signifies the completion of a vast amount of hardening, integration, simplification, and packaging work.
Features of Cloudera Search 1.0 include:
The following guest post is re-published here courtesy of Gerd König, a System Engineer with YMC AG. Thanks, Gerd!
Cloudera Manager is a great tool to orchestrate your CDH-based Apache Hadoop cluster. You can use it from cluster installation, deploying configurations, restarting daemons to monitoring each cluster component. Starting with version 4.6, the manager supports the integration of Cloudera Search, which is currently in Beta state. In this post I’ll show you the required steps to set up a Hadoop cluster via Cloudera Manager and how to integrate Cloudera Search.
Furthermore, I want to introduce a small level of automation using the tool Ansible, to avoid performing the same manual steps again and again – thereby being able to replay them whenever you want, or to just use it for any upcoming cluster installations.
This post is the first in a series of blog posts about Cloudera Morphlines, a new command-based framework that simplifies data preparation for Apache Hadoop workloads. To check it out or help contribute, you can find the code here.
Cloudera Morphlines is a new open source framework that reduces the time and skills necessary to integrate, build, and change Hadoop processing applications that extract, transform, and load data into Apache Solr, Apache HBase, HDFS, enterprise data warehouses, or analytic online dashboards. If you want to integrate, build, or facilitate transformation pipelines without programming and without substantial MapReduce skills, and get the job done with a minimum amount of fuss and support costs, this post gets you started.
A “morphline” is a rich configuration file that makes it easy to define a transformation chain that consumes any kind of data from any kind of data source, processes the data, and loads the results into a Hadoop component. It replaces Java programming with simple configuration steps, and correspondingly reduces the cost and integration effort associated with developing, maintaining, or integrating custom ETL projects.