The current (4.2) release of CDH — Cloudera’s 100% open-source distribution of Apache Hadoop and related projects (including Apache HBase) — introduced a new HBase feature, recently landed in trunk, that allows an admin to take a snapshot of a specified table.
Prior to CDH 4.2, the only way to back-up or clone a table was to use Copy/Export Table, or after disabling the table, copy all the hfiles in HDFS. Copy/Export Table is a set of tools that uses MapReduce to scan and copy the table but with a direct impact on Region Server performance. Disabling the table stops all reads and writes, which will almost always be unacceptable.
In contrast, HBase snapshots allow an admin to clone a table without data copies and with minimal impact on Region Servers. Exporting the snapshot to another cluster does not directly affect any of the Region Servers; export is just a distcp with an extra bit of logic.
Here are a few of the use cases for HBase snapshots:
- Recovery from user/application errors
- Restore/Recover from a known safe state.
- View previous snapshots and selectively merge the difference into production.
- Save a snapshot right before a major application upgrade or change.
- Auditing and/or reporting on views of data at specific time
- Capture monthly data for compliance purposes.
- Run end-of-day/month/quarter reports.
- Application testing
- Test schema or application changes on data similar to that in production from a snapshot and then throw it away. For example: take a snapshot, create a new table from the snapshot content (schema plus data), and manipulate the new table by changing the schema, adding and removing rows, and so on. (The original table, the snapshot, and the new table remain mutually independent.)
- Offloading of work
- Take a snapshot, export it to another cluster, and run your MapReduce jobs. Since the export snapshot operates at HDFS level, you don’t slow down your main HBase cluster as much as CopyTable does.
What is a Snapshot?
A snapshot is a set of metadata information that allows an admin to get back to a previous state of the table. A snapshot is not a copy of the table; it’s just a list of file names and doesn’t copy the data. A full snapshot restore means that you get back to the previous “table schema” and you get back your previous data losing any changes made since the snapshot was taken.
- Take a snapshot: This operation tries to take a snapshot on a specified table. The operation may fail if regions are moving around during balancing, split or merge.
- Clone a snapshot: This operation creates a new table using the same schema and with the same data present in the specified snapshot. The result of this operation is a new fully functional table that can can be modified with no impact on the original table or the snapshot.
- Restore a snapshot: This operation brings the table schema and data back to the snapshot state. (Note: this operation discards any changes made since the snapshot was taken.)
- Delete a snapshot: This operation removes a snapshot from the system, freeing unshared disk space, without affecting any clones or other snapshots.
- Export a snapshot: This operation copies the snapshot data and metadata to another cluster. The operation only involves HDFS so there’s no communication with the Master or the Region Servers, and thus the HBase cluster can be down.
Zero-copy Snapshot, Restore, Clone
The main difference between a snapshot and a CopyTable/ExportTable is that the snapshot operations write only metadata. There are no massive data copies involved.
One of the main HBase design principles is that once a file is written it will never be modified. Having immutable files means that a snapshot just keeps track of files used at the moment of the snapshot operation, and during a compaction it is the responsibility of the snapshot to inform the system that the file should not be deleted but instead it should be archived.
The same principle applies to a Clone or Restore operation. Since the files are immutable a new table is created with just “links” to the files referenced by the snapshot.
Export Snapshot is the only operation that require a copy of the data, since the other cluster doesn’t have the data files.
Export Snapshot vs Copy/Export Table
Aside from the better consistency guarantees that a snapshot can provide compared to a Copy/Export Job, the main difference between Exporting a Snapshot and Copying/Exporting a table is that ExportSnapshot operates at HDFS level. This means that Master and Region Servers are not involved in this operations. Consequently, no unnecessary caches for data are created and there is no triggering of additional GC pauses due to the number of objects created during the scan process. The performance impact on the HBase cluster stems from the extra network and disk workload experienced by the DataNodes.
HBase Shell: Snapshot Operations
Confirm that snapshot support is turned on by checking if the
hbase.snapshot.enabled property in hbase-site.xml is set to true. To take a snapshot of a specified table, use the
snapshot command. (No file copies are performed)
hbase> snapshot ‘tableName’, ‘snapshotName’
To list all the snapshots, use the
list_snapshot command. it will display the snapshot name, the source table, and the creation date and time.
hbase> list_snapshots SNAPSHOT TABLE + CREATION TIME TestSnapshot TestTable (Mon Feb 25 21:13:49 +0000 2013)
To remove a snapshot, use the
delete_snapshot command. Removing a snapshot doesn’t impact cloned tables or other subsequent snapshots taken.
hbase> delete_snapshot 'snapshotName'
To create a new table from a specified snapshot (clone), use the
clone_snapshot command. No data copies are performed, so you don’t end up using twice the space for the same data.
hbase> clone_snapshot 'snapshotName', 'newTableName'
To replace the current table schema/data with a specified snapshot content, use the
hbase> restore_snapshot 'snapshotName'
To export an existing snapshot to another cluster, use the
ExportSnapshot tool. The export doesn’t impact the RegionServers workload, it works at the HDFS level and you have to specify an HDFS location (the hbase.rootdir of the other cluster).
hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot SnapshotName -copy-to hdfs:///srv2:8082/hbase
Snapshots rely on some assumptions, and currently there are a couple of tools that are not fully integrated with the new feature:
- Merging regions referenced by a snapshots causes data loss on the snapshot and on cloned tables.
- Restoring a table with replication on for the table restored ends up with the two cluster out of synch. The table is not restored on the replica.
Currently the snapshot feature includes all the basic required functionality, but there’s still much work to do, including metrics, Web UI integration, disk usage optimizations and more.
To learn more about how to configure HBase and use snapshots, review the documentation.
Matteo Bertozzi is a Software Engineer on the Platform team, and an HBase committer.