Successful cluster administration can be very difficult without a real-time view of the state of the cluster. Solr itself does not provide aggregated views about its state or any historical usage data, which is necessary to understand how the service is used and how it is performing. Knowing the throughput and capacities not only helps detect errors and troubleshoot issues, but is also useful for capacity planning.
Questions may arise, such as:
- What is the size of my cluster and each collection? How fast does it grow?
- What is the query rate on my cluster and collections?
- How many documents do I have in each collection?
- What is the performance of my indexers?
- Are my shards balanced?
Answering questions like these requires detailed and historical collection of metrics.
With Cloudera Manager, users have been able to deploy Solr services on CDH and monitor its health since Solr was first integrated. However, the initial monitoring capabilities did not fully answer the above questions in large Solr cluster deployments, often with multi-tenant applications under an SLA being served by CDH and Cloudera Search.
In this post we present the new and improved capabilities available in Cloudera Manager 5.12 to monitor and troubleshoot Cloudera Search clusters – beyond just server health. We will demonstrate how to access existing charts and set up dashboards and alerts. But first, let’s review the existing powerful capabilities in Cloudera Manager (CM) that collect rich metrics and allow you to create ad-hoc insight-providing dynamic dashboards.
Metrics in Cloudera Manager
Cloudera Manager continuously monitors and collects usage and performance metrics from Solr (and other services running on the shared-storage cluster). The collected metrics are accessible through the Chart Builder feature in Cloudera Manager, where you can build charts and create alerts based on them. Cloudera Manager provides predefined charts with a handful of essential metrics about the cluster’s health that will be demonstrated later in this blog post.
The metrics are collected (and documented) at the service, server, shard, and replica levels, depending on the nature of the metric. For example, the JVM heap size is a server-level metric, whereas the query request rate is measured at the core/replica level.
Cloudera Manager already supports creation of ad-hoc queries on collected metrics. The syntax of the query language is SQL-like, making it easy to learn to run custom queries.
We can run custom queries by selecting Chart > Chart Builder from the Cloudera Manager menu. In the Chart Builder interface we enter the query. For example, we can enter:
This query shows the historical request rate of every replica of every collection on every Solr service that is being managed. We can filter these statistics to a specific service or collection:
select select_requests_rate where serviceName="SOLR-1"
select select_requests_rate where solrCollectionName="collection1"
The filters will select only those replicas that belong to the given service or collection. However, if we want to see an aggregated total of the request rates, we need to use a different approach.
Cloudera Manager creates artificial aggregated metrics for your convenience. The aggregated metrics are summaries of metrics over a certain grouping. For example, the metric
select_requests_rate is aggregated into
total_select_requests_rate_across_solr_replicas, a sum of
select_requests_rate over a shard, collection, or service. We can select the desired aggregation by filtering metrics by category. The example below returns the aggregated
select_requests_rate for each shard within the given collection:
select total_select_requests_rate_across_solr_replicas where solrCollectionName="collection1" category="SOLR_SHARD"
The following query shows the total
select_requests_rate for each collection:
select total_select_requests_rate_across_solr_replicas where category="SOLR_COLLECTION"
We can also get the sum of all
select_requests_rate for the whole service using this query:
select total_select_requests_rate_across_solr_replicas where category="SERVICE"
By using the
category filter, we can specify the aggregation level for the metrics. You may want to experiment with other metrics listed in the documentation to find the ones for your specific needs.
You can learn more about tsquery from the documentation.
Predefined charts for Solr
In Cloudera Manager 5.12, we introduce a set of new and improved charts for monitoring Solr services. The Solr service status page in this release contains 8 essential charts:
- Request Rate: These three charts are summaries and statistical distributions of
- Average Response Time: These three charts display the distribution of average response times for the
- Index Size: The aggregated index size of the cluster, along with the distribution of index sizes among all cores.
- Total Documents: The aggregated number of documents, along with the distribution of document counts among all cores.
These 8 new charts help administrators quickly get an overview of the cluster performance.
Collection Statistics Page
The Collection Statistics page under the Solr service shows more detailed metrics about the cluster and collections. Similarly to the new service status page charts, request rates, average response times, and index sizes are shown at the collection level. These detailed charts are helpful for monitoring the performance and usage of each collection.
In Cloudera Manager 5.12, this page got a minor facelift. The histograms showing the current state only have been replaced with historical diagrams that help visualize the changes over time.
We can also see collection-level statistics. By selecting a specific collection from the side menu, we can access a more detailed view of that collection.
The collection view shows charts for the selected collection only. Index size and document count is displayed at the shard level among the total. This page also has a Cache Hit Ratio chart showing the historical cache efficiency of the document cache, field value cache, filter cache, and query result cache.
Monitoring and Troubleshooting
Let’s take a look at a few scenarios where we can troubleshoot or detect unexpected usage using the charts and metrics introduced above.
Sudden request rate change
A significant drop in the total request rate could indicate a malfunctioning client application, network issue, or misconfiguration. A sudden drop in the update request rate can indicate an indexer job error. We can use the improved Collection Statistics page to see which collection has a recent decrease in request rate.
Average response time increase
One of the key indicators of performance is the time it takes for a request to be served. If there is a significant increase in the average response times, it might indicate a performance bottleneck or other malfunction. On the service status page charts, the maximum value of the average response times indicates the core/replica that is performing the slowest on average. By expanding the chart (double-arrow at the top right corner) and selecting a point on the maximum line, we can see which replica is reporting that average response time. We can then investigate the given collection to determine whether the performance drop is localized to the replica or collection, or if it affects the entire cluster.
For example, figure 4 shows that replica
collection2_shard2_replica2 has a higher average response time on updates than the rest of the cluster.
On the Collection Statistics page, under each collection, we can see the index size and document count at the shard level. In an ideal situation, the document count (and therefore, the index size) is evenly distributed. If the shard-key contains an account ID (or any other unique prefix), the distribution of documents among shards can correlate with the document count of that account ID. Thus, if an account ID has significantly more documents than others, the shard it belongs to will also have more documents. That could lead to uneven resource utilization and performance issues. The collection-level chart of index sizes and document counts can help you easily identify unbalanced shards.
Setting up alerts on Solr metrics
Proactive monitoring of a Solr service is much easier using metrics and predefined charts. In addition to periodically viewing charts, it is useful to set up alerts and notifications for certain scenarios that we want to monitor.
Cloudera Manager supports triggers that let you track performance and perform predefined actions when conditions are met. You can create any tsquery on any metrics and set the health state depending on the returned values. For example, you can create a trigger that sets the cluster state to Concerning whenever the average response time goes above 2 seconds.
You can create triggers using custom queries, or you can use queries from existing charts. In this example, we are going to create a trigger based on the Select Average Response Time chart found on the Solr service status page.
- Click on the ‘gear’ icon in the corner of the Select Average Response Time chart.
- Enter a Name for the trigger (for example, ‘Slow select requests’).
- Modify the pre-populated trigger formula such that the query only returns values over 0.5 seconds.
- Click Create Trigger.
The newly created trigger will set the Solr service state to Concerning whenever the highest average response time among replicas exceed 0.5 seconds.
Creating Custom Dashboards
The default diagrams show essential information about the Solr service. We can also create custom dashboards by selecting the metrics that best suit our requirements.
As a demonstration, we are going to create a Service Level dashboard that displays 99th percentile response times for different query types on each collection.
- Select Charts > Dashboards in Cloudera Manager.
- Enter a dashboard name (for example: ‘Solr Service Level’).
- Click the Create button.
- Click the View Dashboard link.
This creates an empty dashboard. We are going to add charts with the
*_99th_pc_request_time_across_solr_replicas metric that measures the slowest request time of the fastest 99% of requests. In other words, only 1% of the requests are served slower than the reported value. You can also use 75th, 95th, or 99.9th percentiles.
- On the Service Level dashboard page, click the Add Chart button.
- Enter the following tsquery:
1select select_99th_pc_request_time_across_solr_replicas where category="SOLR_COLLECTION"
- Enter a title: Select Request Time 99th Percentile
- Optionally, you can select under facets All Separate to have one chart per collection.
- Click Save.
- Repeat 1-5, changing the metric prefix from
update_and changing the title accordingly.
We now have a custom dashboard displaying Solr service level metrics for select and update requests.
Cloudera Manager is a powerful tool to utilize efficiently the metrics exposed by Solr and for other services running in a CDH deployment. Usage of the custom CM queries allows customization for specific requirements.
In this blog you have learned how to use the custom CM queries to aggregate metrics. You have gotten a highlight of all the available new monitoring and stats charts pre-generated and available for more granular Cloudera Search monitoring and troubleshooting. You’ve also learned a bit about how these charts can help specific troubleshooting scenarios as well as how to set up triggers. We hope that all this new information and new capabilities will help you in your production environment and we look forward to your feedback!
If you would like to learn more about the subject, you can read the corresponding documentation.
- Documentation of Solr metrics: service, server, shard, replica
- tsquery language documentation
- Cloudera Manager Trigger Use Cases
- Cloudera Manager – Dashboards