Recent improvements to Apache Hadoop’s native backup utility, which are now shipping in CDH, make that process much faster.
DistCp is a popular tool in Apache Hadoop for periodically backing up data across and within clusters. (Each run of DistCp in the backup process is referred to as a backup cycle.) Its popularity has grown in popularity despite relatively slow performance.
In this post, we’ll provide a quick introduction to DistCp.