In December 2012, we described how an internal application built on CDH called Cloudera Support Interface (CSI), which drastically improves Cloudera’s ability to optimally support our customers, is a unique and instructive use case for Apache Hadoop. In this post, we’ll follow up by describing two new differentiating CSI capabilities that have made Cloudera Support yet more responsive for customers:
- How Cloudera Impala has turbo-charged CSI with support for real-time log file analysis and visualization
- How Cloudera Search enables interactive data exploration of multiple sources simultaneously from within CSI
So, let’s explore these use cases in detail.
Log File Visualization with Impala
As you may recall from our previous blog post on this topic, Cloudera Manager has a feature that allows a user to send diagnostic bundles (which are fairly large files) to Cloudera Support to help us diagnose issues more quickly and comprehensively. Cloudera Support ingests these bundles into HDFS, processes them via a thorough and robust data pipeline, and then visualizes the data for our customer operation engineers (COEs) via the CSI GUI. (COEs are Cloudera engineers who work full-time on technical support issues.).
Before the availability of Impala, COEs would often ask the customer to manually provide individual logs to facilitate troubleshooting. This process often involved some back-and-forth with the customer, lengthening overall resolution time. Furthermore, without an easy way to view multiple logs together in a single view, or to window logs by timeframe or other variables, the log analysis process lengthened overall resolution time that much more.
With the addition of Impala to the CSI suite in early 2013, logs are now included in bundles and ingested into HDFS by default, and that data is accessible via interactive SQL queries. Adding those queries to the back-end of the CSI GUI, just like a BI tool, enables a range of new log visualization features:
- Each log has a timeline (with histogram) that the COE can narrow into a specific time span as needed.
- Logs load on scroll, exposing 500 lines at a time, and as the user scrolls the log window, more data loads automatically.
- COEs can view logs side-by-side to compare two logs across a set time span.
- A search function that creates Impala queries on the back-end lets COEs do free-text searches for specific words in the logs they are viewing, or across all logs in the bundle.
The side-by-side window function in CSI. COEs can adjust date/time ranges via slider.
A given diagnostic bundle, typically up to 80GB in size uncompressed, can be processed for visualization at interactive speeds. Using an intelligent partitioning strategy, and a columnar file format, our 30+ billion row Impala table can be sliced and diced within seconds. With that short processing time and the new ability to do interactive visualization, COEs can analyze and compare log information much more easily and quickly than before.
Data Exploration with Search
The other significant enhancement to our internal support processes is a new application inside CSI called Monocle, which lets COEs do keyword searches of multiple information sources simultaneously. Furthermore, thanks to Cloudera Search, all the data involved is ingested, processed, and indexed on the same CDH cluster as everything else.
COEs have a wealth of information as reference material — including JIRAs, support cases, mailing lists, the support knowledge base, and most recently, discussion forums. But because this information lives in different places, prior to Monocle, COEs doing exploratory research had no choice but to do a series of cumbersome, independent searches using different tools.
Today, with the Cloudera Search-based Monocle in place, COEs can work smarter and more efficiently by searching across all those sources from a single UI. Monocle helps our support team find relevant content quickly and easily, ultimately leading to faster response for support issues.
Monocle lets COEs do keyword searches across multiple sources from a single GUI.
For those of you interested in implementation details, here’s how the Monocle data flow works:
- Data is fetched from multiple sources, with some stored in HBase.
- Replication is enabled in HBase and the Lily HBase Indexer registers as a peer cluster.
- When updates are replicated, Lily HBase Indexer creates Solr documents and indexes them in Cloudera Search.
- For data that is not available in HBase (such as PDFs), sources are periodically scanned, and then data is extracted using Apache Tika and indexed using Cloudera Morphlines.
Monocle features include:
- High availability, scalability, and extensibility backed by Solr Cloud, using HDFS for storage and Cloudera Manager for monitoring
- Near real-time indexing of content sources
- Automatic support for new collections and dynamic reflection of changes in schema
- Automatic faceting of all indexed data, which provides a quick overview for search-result classification and filtering of results
- Presentation of relevant search terms based on indexed content, offering quick context with links back to original source
- Range filtering of search results
- Cross-collected queries and interspersed results along with matching scores
Similar to the new Impala-powered log file analysis capabilities in CSI, Monocle helps COEs get to the bottom of customer issues much more quickly than in the past by radically streamlining internal support processes.
As you can see, Cloudera Support’s ability to internally take advantage of differentiating features in our platform via CSI – in this case, Impala and Search – directly translates into a better customer experience. And as the number of differentiating features expands, that experience will only get better.
Krista Mizusaki is a program manager in Cloudera Support.