Understanding MapReduce via Boggle, Part 2: Performance Optimization

In Part 1 of this series, you learned about MapReduce’s ability to process graphs via the example of Boggle*. The project’s full source code can be found on my GitHub account.

The example comprised a 4×4 matrix of letters, which doesn’t come close to the number of relationships in a large graph. To calculate the number of possible combinations, we turned off the Bloom Filter with “-D bloom=false“. This enters a brute-force mode where all possible combinations in the graph are traversed. In a 4×4 or 16-letter matrix, there are 11,686,456 combinations, and a 5×5 or 25-letter matrix has 9,810,468,798 combinations.

As previously discussed, increasing matrix sizes is an important part of scaling up. We also want to effectively use the cluster when processing the graph. In this post, I’ll describe some of the performance optimizations I used to improve performance and scalability.

Although Boggle is sold in only two versions, with 4×4 and 5×5 (“Big Boggle”) matrices, in our code you can create arbitrarily-sized matrices with the rollversion parameter. For example, passing in "-D rollversion=10" to the driver will create a random 10×10 matrix with the dice randomly chosen from the Big Boggle dice.

Bloom Filter

A Bloom Filter is a memory-constrained hash function. Instead of taking up several MB of RAM, the Bloom Filter can be constrained to whatever size or memory limits the programmers wants to use. A Bloom Filter is used for quick membership tests. This allows a program to quickly answer the question: “Could this object possibly be a member or part of this dataset?”

For the Boggle code, I use the Bloom Filter to avoid or filter out possible dead ends. Let’s take another look at the Boggle roll.

We humans are very good at throwing out dead ends. In this roll, for example, we wouldn’t spend much time on the “Z” at [2, 3]. Why not? Because there aren’t as many letters in the English language that start with or have a “Z” in them. Our time is better spent looking at other letters like “T” or “R”, which occur much more often in English.

To weed out these dead ends, we could use a complex statistical model based on letter occurrence or we could use a simple Bloom Filter. Since we know all the words in the English language, we can pass them through a Bloom Filter and it can tell whether the current node is a dead end.

Consider this code from the program that creates the Bloom file. Note that for brevity, all of these code snippets will have portions taken out. Please consult the code on GitHub for the unedited code.

 

This code goes through the dictionary file containing all words in the English language. The word is then broken down into gradually bigger pieces with each iteration containing one more character of the word until the entire word is added. This iteration needs to happen because this is how the graph is traversed. A word is found one iteration at a time in the mapper.

Here is the relevant code from the mapper.

 

The newly iterated word is passed through the Bloom Filter to see if the word might exist. 

Notice that I said “might”. That’s one of the compromises you make when using a Bloom Filter: you may get false positives, although you will not get false negatives. In other words, if the Bloom Filter says “no,” it really means no; if it says “yes,” that’s more like a “maybe.”

Now that we’ve learned a little about the Bloom Filter and its use in the Boggle algorithm, let’s talk about the benefits. Here is some output from the Boggle program:

 

This speaks to the savings in graph processing created by the Bloom Filter. By throwing out dead ends early on and quickly, we saved a lot of processing time. Looking at our Boggle roll, the Bloom filter allowed us to throw out words like “ZR”, “ZT” and “ZF”. This saved us a significant amount of processing because the permutations have factorial growth. Throwing out the word early means it and all of its grandchildren, great-grandchildren, great-great-grandchildren, and all descendants are never even processed. Although this example says the Bloom saved 9,835 nodes, it saved significantly more when you count all descendants.

To illustrate this, look at another set of outputs from two Boggle program runs on a 5×5 matrix with a three-node cluster:

 

There are two main data items we see in the outputs. The optimized Bloom took a lot less time to run and had a much lower false positive rate; the unoptimized Bloom took 1 hour, 26 minutes, and 31 seconds and the optimized Bloom took 3 minutes, 24 seconds. The unoptimized Bloom took so much longer because that run pursued many more dead ends and couldn’t throw them out as early. By comparison, the unoptimized Bloom run found 9,733,548,413 potential words and the optimized Bloom run found 4,376 potential words. To put it another way, the unoptimized Bloom traversed 99.2% of all possible combinations and the optimized Bloom only did 0.00004%. There are nowhere near 9.7 billion words in the roll and the vast majority of those words represented false positives (9,810,468,798).

Although your use case may not use Bloom Filters, the lesson still applies. Try to throw out data in your mapper as early as possible, especially when you’re dealing with iterative MapReduce algorithms.

Compression

Depending on your data, adding compression can improve your job speed and not just size on disk. For the Boggle data, adding compression did both.

Consider this code from the driver.

 

With some relatively simple code, I added Snappy compression. The second line sets this compression to run at block level. The size of a snappy block will vary depending on the version of CDH but the default is 32K or 256K. You can set this parameter with io.compression.codec.snappy.buffersize.

If you remember, the mapper which traverses the graph is pretty chatty. It will output every previous or final node as well as all adjacent nodes. It also does this by iterating over the graph until it is completely done. The previous mapper’s output is fed into the next mapper’s input.

By compressing the input on both sides, we save disk space and improve the speed. In this case, the block compresses down to a smaller size while retaining all information. On I/O-bound jobs with data that compresses well, this can overcome some of the speed issues related to hardware. In this case, it’s faster to compress, write to disk, and decompress again during the next iteration than not compressing and simply writing to disk.

Sequence Files

Sequence files are flat files with binary keys and values that allow automatic serialization to the mapper and reducer.

Before using sequence files, I was serializing the data out to plain text. This caused extra time and space to be wasted doing my own serialization and deserialization when it’s built into Hadoop.

It helped clean up some of my code, too. You can look at the diff to see that it removed a lot of the parsing and error handling from the code.

Using sequence files is easy. Here is the setup from the driver:

 

This section sets the graph traversal iterations to use sequence files as the input and output. In order to prime the pump or write out the adjacency matrix of the initial Boggle roll, we use this code:

 

This code creates a file per row of the matrix. By creating a row per file, the cluster can distribute the load across multiple mappers.

The final MapReduce job that finds the actual words needs to be changed, too.

 

The job is configured to use sequence files as the input. This input comes from the final graph traversal mapper whose output is in sequence file format. Note that the output file format is not set. We want the output format to be a plain text representation of the nodes.

I hope you enjoyed learning a little about Graph Theory using MapReduce. While your project may not be using graphs, some of the performance tricks might benefit your MapReduce programs. Most of these tricks require little extra coding and could give your program that added boost.

Jesse Anderson is a Curriculum Developer and Instructor for Cloudera University.

* Boggle is a trademark of Hasbro.

Filed under:

No Responses

Leave a comment


× one = 3