Guest Post by Michael Schatz and Jimmy Lin
Michael Schatz is an assistant professor in the Simons Center for Quantitative Biology at Cold Spring Harbor Laboratory. His research interests are in developing large-scale DNA sequence analysis methods to search for DNA sequence variations related to autism, cancer, and other human diseases, and also to assemble the genomes of new organisms. Given the recent tremendous advances of DNA sequencing technologies, Michael has pioneered the use of Hadoop and cloud computing for accelerating genomics, as described in a guest blog post last fall.
Jimmy Lin is an associate professor in the College of Information Studies at the University of Maryland. His research lies at the intersection of information retrieval and natural language processing, with an emphasis on large-scale distributed algorithms. Currently, Jimmy is spending his sabbatical at Twitter.
Part 1: Graphs and Hadoop
Question: What do PageRank, the Kevin Bacon game, and DNA sequencing all have in common?
As you might know, PageRank is one of the many features Google uses for computing the importance of a webpage based on the other pages that link to it. The intuition is that pages linked from many important pages are themselves important. In the Kevin Bacon game, we try to find the shortest path from Kevin Bacon to your favorite movie star based on who they were costars with. For example, there is a 2 hop path from Kevin Bacon to Jason Lee: Kevin Bacon starred in A Few Good Men with Tom Cruise, whom also starred in Vanilla Star with Jason Lee. In the case of DNA sequencing, we compute the full genome sequence of a person (~3 billion nucleotides) from many short DNA fragments (~100 nucleotides) by constructing and searching the genome assembly graph. The assembly graph connects fragments with the same or similar sequences, and thus long paths of a particular form can spell out entire genomes.
The common aspect for these and countless other important problems, including those in defense & intelligence, recommendation systems & machine learning, social networking analysis, and business intelligence, is the need to analyze enormous graphs: the Web consists of trillions of interconnected pages, IMDB has millions of movies and movie stars, and sequencing a single human genome requires searching for paths between billions of short DNA fragments. At this scale, searching or analyzing a graph on a single machine would be time-consuming at best and totally impossible at worst, especially when the graph cannot possibly be stored in memory on a single computer.
Fortunately, Hadoop and MapReduce can enable us to tackle the largest graphs around by scaling up many graph algorithms to run on entire clusters of commodity machines. The idea of using MapReduce for large-scale graph analysis is as old as MapReduce itself PageRank was one of the original applications for which Google developed MapReduce.
Formally, graphs are comprised of vertices (also called nodes) and edges (also called links). Edges may be directed (e.g., hyperlinks on Web) or undirected (e.g., costars in movies). For convenient processing in MapReduce, graphs are stored as key-value pairs, in which the key is the vertex id (URL, movie name, etc), and the value is a complex record called a tuple that contains the list of neighboring vertices and any other attributes of the graph vertices (text of the webpage, date of the movie, etc). The key point is that the graph will be distributed across the cluster so different portions of the graph, including direct neighbors, may be stored on physically different machines. Nevertheless, we can process the graph in parallel using Hadoop/MapReduce, to compute PageRank or solve the Kevin Bacon game without ever loading the entire graph on one machine.
Graph algorithms in Hadoop/MapReduce generally follow the same pattern of execution: (1) in the map phase, some computation is independently executed on all the vertices in parallel, (2) in the shuffle phase, the partial results of the map phase are passed along the edges to neighboring vertices, including when those vertices are located on physically different machines, and (3) in the reduce phase, the vertices compute a new value based on all the incoming values (once again in parallel). Generically, we can speak of vertices passing messages to their neighbors. For example, in PageRank the current PageRank value of each vertex is divided up and distributed to their neighbors in the map and shuffle phases, and in the reduce phase the destination vertices compute their updated PageRank value as the sum of the incoming values. If necessary, the algorithm can iterate and rerun the MapReduce code multiple times, each time updating a vertexs value based on the new values passed from its neighbors.
This algorithm design pattern fits the large class of graph algorithms that need to distribute messages between neighboring vertices. For search problems like the Kevin Bacon game, we can use this pattern to execute a frontier search that initially distributes the fact that there is a 1-hop path from Kevin Bacon to all of his immediate costars in the first MapReduce iteration. In the second MapReduce iteration the code extends these partial 1-hop paths to all of his 2-hop neighbors, and so forth until we find the shortest path to our favorite movie star. Be warned, though, that frontier search algorithms generally require space that is exponential in the search depth therefore a naïve frontier search in MapReduce is not appropriate for searching for very deep connections: you may exhaust the disk storage of your cluster or wait a long time waiting for the network to shuffle terabytes upon terabytes of intermediate data. In contrast, PageRank is computed using values just from immediate neighbors, and is therefore more suitable for parallelization with Hadoop/MapReduce.
The other main technical challenge of MapReduce graph algorithms is that the graph structure must be available at each iteration, but in the design above we only distribute the messages (partial PageRank values, partial search paths, etc). This challenge is normally resolved by passing along the graph structure from the mappers to the reducers. In more detail: the mapper reads in a vertex as input, emits messages for neighboring vertices using the neighboring vertex ids as the keys, and also reemits the vertex tuple with the current vertex id as the key. Then, as usual, the shuffle phase collects key-value pairs with the same key, which effectively collects together a vertex tuple with all the messages destined for that vertex (remember, this happens in parallel on multiple reducers). The reducer then processes each vertex tuple with associated messages, computes an updated value, and saves away the updated vertex with the complete graph structure for the next iteration. But wait, you might ask: doesnt this entail the mappers emitting two different types of values (messages destined for neighboring vertices and the graph structure)? Yes, this is handled by tagging each value to indicate which type it is, so that the reducer can process appropriately. For more details about such graph algorithms, you be interested in Jimmy Lin and Chris Dyers recent book on MapReduce algorithm design.
This basic design works and with it we can compute PageRank, solve the Kevin Bacon game, assemble together genomes, and attack many other large-scale graph problems. However, it has several inefficiencies that needlessly slow it down, such as the poor use of locality and substantial unnecessary computation. In part two we will explore the causes of those inefficiencies, and present a set of simple techniques called Schimmy that we developed that can dramatically improve the runtime of virtually all Hadoop/MapReduce graph algorithms without requiring any changes to the underlying Hadoop implementation.