How-to: Backup and disaster recovery for Apache Solr (part I)

by Hrishikesh Gadre

Posted in Technical | May 11, 2017 5 min read

Cloudera Search (that is Apache Solr integrated with the Apache Hadoop eco-system) now supports (as of C5.9) a backup and disaster recovery capability for Solr collections.

In this post we will cover the basics of the backup and disaster recovery capability in Solr and hence in Cloudera Search. In the next post we will cover the design of the Solr snapshots functionality and its integration with the Hadoop ecosystem as well as public cloud platforms (e.g. Amazon AWS).

Data availability is critical to most organizations and end-users. Lots of production data is being served through Cloudera Search as a mission-critical service. Be it through Cloudera Search or stand-alone Solr, it has long been a challenge for organizations to reduce risk when change is on the way: upgrades, development of applications, change of configurations, etc.

Backup and disaster recovery capability in Solr addresses a number of concerns related to storing business critical data in Solr. Specifically,

How to recover Search indices from catastrophic scenarios such as loss of index data due to accidental or malicious administrative actions (e.g. deletion of a collection) or data entry/subtraction (e.g. deletion of one or more documents)?
How to migrate existing Solr indices to a different cluster (on prem or in cloud)?
How to mitigate the risk during Solr cluster upgrades?

The backup mechanism allows an administrator to create a physically separate copy of index files and configuration metadata for a Solr collection. Any subsequent change to a Solr collection state (e.g. removing documents, deleting index files or changing collection configuration) has no impact on the state of this backup. As part of disaster recovery, the restore operation creates a new Solr collection and initializes it to the state represented by a Solr collection backup.

The backup operation consists of following steps,

Capture the consistent and point-in-time view of underlying Apache Lucene indices corresponding to the Solr collection being backed. In the Lucene terminology, this consistent and point-in-time view of the index is represented as an index commit.
The snapshot functionality in Solr implements this step, ensuring that the state of backup is consistent even in presence of concurrent indexing (or query) operations. This allows users to backup Solr collections without any disruption (or downtime) of Solr cluster.
Copy the Lucene index files associated with the captured index commit (in step I) and collection metadata in Apache Zookeeper to a user specified location on a shared file-system (e.g. Apache HDFS or NFS based file-system).

We’ll now explore and show you how to use the commands provided by Cloudera Search to perform backup and disaster recovery for Solr collections quickly and easily.

First of all, create a collection named “books” and index some sample data to this collection. This is required only for the demo purpose. In your environment you would already have a collection with existing data.

$ solrctl instancedir --generate books
$ solrctl instancedir --create books books/
$ solrctl collection --create books

Now initialize this collection with some sample data. Following command inserts a single document to this collection and issues a hard-commit. Since the Solr backup functionality works only on the hard-committed data, please remember to issue a hard-commit before performing the backup operation.

$ curl 'http://localhost:8983/solr/books/update?commit=true' \
> -H 'Content-type:application/json' -d '
> [
>  {"id" : "book1",
>   "title" : "American Gods",
>   "author" : "Neil Gaiman"
>  }
> ]'
{"responseHeader":{"status":0,"QTime":444}}

At this point we are ready to create a snapshot for the “books” collection. A snapshot is a piece of metadata referring to the specific Lucene index commit. Solr guarantees that such index commit is preserved during the lifetime of the snapshot, in spite of subsequent index optimizations. This enables a Solr collection snapshot to provide a point-in-time, consistent state of index data even in presence of concurrent index operations. Note that the snapshot creation is very fast since it just persists the snapshot metadata and does not copy the associated index files.

The following command creates a snapshot named “my-snap”.

$ solrctl collection --create-snapshot my-snap -c books
Successfully created snapshot with name my-snap for collection books

Following command provides the details of this snapshot,

$ solrctl collection --describe-snapshot my-snap -c books
Name: my-snap
Status: Successful
Time of creation: Fri, 28 Oct 2016 12:17:45 PDT
Total number of cores with snapshot: 1
-----------------------------------
Core [name=books_shard1_replica1, leader=true, generation=1, indexDirPath=hdfs://name-node-host:8020/solr/books/core_node1/data/index/]

You can also list existing snapshots for a collection using the following command,

$ solrctl collection --list-snapshots books 
my-snap

Once a snapshot is created, you can use it to recover from data manipulation errors, for example insertion of new documents or deletion (or updates to) existing documents. But from a disaster recovery perspective, just creating a snapshot is not enough. There are many scenarios which can lead to a data-loss even when a snapshot is created. For example, a software bug in Lucene/Solr can corrupt the index files associated with this snapshot, or an administrator can accidentally delete the collection or perform other admin operations, such as deleting a replica or splitting one or more shards. Hence it is important to backup the state of this snapshot to a different location (ideally outside the purview of Solr).

The backup functionality in Solr requires a shared file-system to store the Solr collection index files and configuration metadata. Before backup can be performed, please make sure that solr.xml in your installation contains this configuration section. (Note – restart of Solr service is required after adding this section to solr.xml).

Now create a directory in HDFS to store the Solr collection backups. The Solr service user (solr by default) must be able to read and write to this directory. We will configure Solr service user to be the owner of the backup directory for the purpose of this blog post. But you can also use HDFS ACLs capability to enable other users to backup and restore Solr collections.

$ sudo -u hdfs hdfs dfs -mkdir /solr-backups
$ sudo -u hdfs hdfs dfs -chown solr:solr /solr-backups

Note – ‘hdfs’ is the HDFS admin user.

At this point we are ready to backup the earlier created snapshot (which is “my-snap”). Use following command for this purpose. You can review the contents of backup directory once the backup operation is complete.

$ sudo -u solr solrctl collection --export-snapshot my-snap -c books -d /solr-backups

Once the backup is created successfully, you can safely delete the earlier created snapshot. This allows Solr to delete the associated Lucene index commit and free up the storage space on the cluster. Following command deletes the snapshot,

$ solrctl collection --delete-snapshot my-snap -c books
Successfully deleted snapshot with name my-snap for collection books

At this point we are ready to restore from the earlier created backup. The restore operation in Solr creates a new collection with identical configuration as the original collection at the time of the backup. Solr also supports overriding some configuration parameters such as replication factor, configset name etc.

The completion time of the restore operation depends on the index size of the original (backed up) collection as well as the configured replication factor. Hence for restoring large collections, the recommendation is to use the asynchronous collections Admin API support in Solr. The solrctl tool in Cloudera Search mandates this by requiring user to pass a unique request identifier as part of the restore command (using the -i parameter). The restore command just initiates the restore operation and returns immediately.

Run the following command to restore from the backup,

$ sudo -u solr solrctl collection --restore books_restored -l /solr-backups -b my-snap -i req_0

The status of this operation can be monitored by repeatedly invoking following command until the status is completed (or failed),

$ solrctl collection --request-status req_0
<?xml version="1.0" encoding="UTF-8"?>
<response>
  <lst name="responseHeader">
    <int name="status">0</int>
    <int name="QTime">1</int>
  </lst>
  <lst name="status">
    <str name="state">completed</str>
    <str name="msg">found req_0 in completed tasks</str>
  </lst>
</response>

This command prints the status of the specified request_id. The status can be one of the following,

running
completed (i.e. successful)
failed
notfound

Conclusion

In this post we covered the basics of the backup and disaster recovery capability in Cloudera Search. Please refer to the Cloudera Search documentation for details. You can also check my talk on “Backup and Disaster Recovery for Solr” at SFBay Apache Lucene/Solr meetup here.

Hrishikesh Gadre

More by this author