Analyzing Human Genomes with Apache Hadoop
Every day, we hear about people doing amazing things with Apache Hadoop. The variety of applications across industries is clear evidence that Hadoop is radically changing the way data is processed at scale. To drive that point home, we’re excited to host a guest blog post from the University of Maryland’s Michael Schatz. Michael and his team have built a system using Hadoop that drives the cost of analyzing a human genome below $100 — and there’s more to come! Like Michael, we’re excited about the power that Hadoop offers biotech researchers. Thanks, Michael! -Christophe
Ben Langmead and I are very pleased to announce the release of Crossbow, an open-source, Hadoop-enabled pipeline for quickly, accurately, and cheaply analyzing human genomes in the clouds. DNA sequencing has improved tremendously since the completion of the human genome project in 2003, and it is now possible to sequence a genome in a few days for about 50 thousand dollars. The more-than-one-thousand-fold improvement in throughput and cost is spurring a new era of biomedical research, where genomes from many individuals are sequenced and studied over the course of one project. Human genomes are about 99.9% identical, explaining our overall similarity, but discovering differences between genomes is the key to understanding many diseases, including how to treat them.
While sequencing has undoubtedly become an important and ubiquitous tool, the rapid improvements in sequencing technology have created a “firehose” problem of how to store and analyze the huge volume DNA sequence data being generated. The human genome is about 3 billion DNA nucleotides (characters), about the same as the English portion of the Wikipedia. Storing or searching one genome by itself is not too difficult, and standard tools are quite efficient for searching it on a single computer. However, because of the limitations of DNA sequencing technology, we cannot simply read an entire genome end-to-end. Instead the machine reports a very large number of tiny fragments called reads, each 25-500 letters long, collected from random locations in the genome. Then, much like how raindrops will eventually cover the whole sidewalk, we can sequence an entire genome by sequencing many billions of reads, with 20-fold to 30-fold oversampling to ensure each nucleotide is seen. Presently, this process generates about 100GB of compressed data (read sequences and associated quality scores) for one human genome.
Once collected, we can map the billions of reads to the reference human genome using sequence alignment algorithms, and then scan the alignments to find differences between the newly sequenced genome and the reference genome. Again, the problem of mapping and scanning 100GB of data isn’t too onerous, especially for large sequencing centers with large compute grids, and recent studies of individual sequenced genomes have been able to do the analysis in about 1000 CPU hours of computation. The “problem” is sequencing technology is continuing to improve, and pretty soon a single sequencing machine will generate 100GB of data in a few hours. If our computational methods aren’t as efficient as our sequencing methods, we’ll only get further and further behind as more and more data arrives. Clearly we need very efficient and scalable methods if we hope to keep up, especially as sequencing moves from large sequencing centers, to smaller research centers, and perhaps eventually to hospitals and clinical labs.
This is exactly the problem Crossbow aims to solve. Crossbow combines one of the fastest sequence alignment algorithms, Bowtie, with a very accurate genotyping algorithm, SoapSNP, within Hadoop to distribute and accelerate the computation. The pipeline can accurately analyze an entire genome in one day on a 10-node local cluster, or in about three hours for less than $100 using a 40-node, 320-core cluster rented from Amazon’s EC2 utility computing service. Our evaluation against a “gold standard” of known differences within the individual shows Crossbow is better than 99% accurate at identifying differences between human genomes. We set out to create a tool that could reproduce the analysis of a recent whole genome study, and we did exactly that, only it is much much faster, and runs in the clouds. As such, any researcher in the world can reproduce our results, or use our pipeline to analyze their own data. As sequencing reaches an ever wider audience and becomes used in small labs, Crossbow will enable the computational analysis without requiring researchers to own or maintain their own compute infrastructure.
This is a compelling result from both a users and a systems perspective: it is an accurate, fast, and cheap way of squeezing 1000 hours of computation into an afternoon, all made possible with MapReduce/Hadoop. It is also noteworthy that Crossbow uses Hadoop Streaming so that we could reuse existing tools written in C rather than reimplementing their sophisticated algorithms in Hadoop’s native Java. In this way Hadoop was a good fit for our needs: it runs and monitors Bowtie and SOAPsnp in parallel on many nodes, adds fault tolerance, and takes care of the massive distributed sorts that are needed for the analysis. Now that we are starting to think MapReduce/Hadoop, several extensions to Crossbow are apparent, and we are thinking about how to apply these techniques to analyze copy number variations, RNA-seq data, Methyl-Seq, ChIP-seq, structural variations, and more. I’m also nearly done with a MapReduce/Hadoop based de novo assembler that scales to assemble mammalian genomes from short reads.
I’m really excited about Crossbow, and about the role of Hadoop in Computational Biology. Crossbow solves one of the biggest problems in personalized genomics research, and I hope it will be used someday to understand or cure diseases. Furthermore, Crossbow shows how Hadoop can be a enabling technology for computational biology, and I foresee widespread use of it in the future.
For more information see: http://bowtie-bio.sf.net/crossbow.