Data scientists, that peculiar mix of software engineer and statistician, are notoriously difficult to interview. One approach that I’ve used over the years is to pose a problem that requires some mixture of algorithm design and probability theory in order to come up with an answer. Here’s an example of this type of question that has been popular in Silicon Valley for a number of years:

*Say you have a stream of items of large and unknown length that we can only iterate over once. Create an algorithm that randomly chooses an item from* *this stream such that each item is equally likely to be selected.*

The first thing to do when you find yourself confronted with such a question is to **stay calm**. The data scientist who is interviewing you isn’t trying to trick you by asking you to do something that is impossible. In fact, this data scientist is desperate to hire you. She is buried under a pile of analysis requests, her ETL pipeline is broken, and her machine learning model is failing to converge. Her only hope is to hire smart people such as yourself to come in and help. She wants you to succeed.

Remember: Stay Calm.

The second thing to do is to think deeply about the question. Assume that you are talking to a good person who has read Daniel Tunkelang’s excellent advice about interviewing data scientists. This means that this interview question probably originated in a real problem that this data scientist has encountered in her work. Therefore, a simple answer like, “I would put all of the items in a list and then select one at random once the stream ended,” would be a bad thing for you to say, because it would mean that you didn’t think deeply about what would happen if there were more items in the stream than would fit in memory (or even on disk!) on a single computer.

The third thing to do is to create a *simple* example problem that allows you to work through what should happen for several concrete instances of the problem. The vast majority of humans do a much better job of solving problems when they work with concrete examples instead of abstractions, so making the problem concrete can go a long way toward helping you find a solution.

### A Primer on Reservoir Sampling

For this problem, the simplest concrete example would be a stream that only contained a single item. In this case, our algorithm should return this single element with probability 1. Now let’s try a slightly harder problem, a stream with exactly two elements. We know that we have to hold on to the first element we see from this stream, because we don’t know if we’re in the case that the stream only has one element. When the second element comes along, we know that we want to return one of the two elements, each with probability 1/2. So let’s generate a random number *R* between 0 and 1, and return the first element if *R* is less than 0.5 and return the second element if *R* is greater than 0.5.

Now let’s try to generalize this approach to a stream with three elements. After we’ve seen the second element in the stream, we’re now holding on to either the first element or the second element, each with probability 1/2. When the third element arrives, what should we do? Well, if we know that there are only three elements in the stream, we need to return this third element with probability 1/3, which means that we’ll return the other element we’re holding with probability 1 – 1/3 = 2/3. That means that the probability of returning each element in the stream is as follows:

- First Element: (1/2) * (2/3) = 1/3
- Second Element: (1/2) * (2/3) = 1/3
- Third Element: 1/3

By considering the stream of three elements, we see how to generalize this algorithm to any N: at every step N, keep the next element in the stream with probability 1/N. This means that we have an (N-1)/N probability of keeping the element we are currently holding on to, which means that we keep it with probability (1/(N-1)) * (N-1)/N = 1/N.

This general technique is called reservoir sampling, and it is useful in a number of applications that require us to analyze very large data sets. You can find an excellent overview of a set of algorithms for performing reservoir sampling in this blog post by Greg Grothaus. I’d like to focus on two of those algorithms in particular, and talk about how they are used in Cloudera ML, our open-source collection of data preparation and machine learning algorithms for Hadoop.

### Applied Reservoir Sampling in Cloudera ML

The first of the algorithms Greg describes is a *distributed* reservoir sampling algorithm. You’ll note that for the algorithm we described above to work, all of the elements in the stream must be read sequentially. To create a distributed reservoir sample of size K, we use a MapReduce analogue of the ORDER BY RAND() trick/anti-pattern from SQL: for each element in the set, we generate a random number *R* between 0 and 1, and keep the K elements that have the largest values of *R*. This trick is especially useful when we want to create stratified samples from a large dataset. Each stratum is a specific combination of categorical variables that is important for an analysis, such as gender, age, or geographical location. If there is significant skew in our input data set, it’s possible that a naive random sampling of observations will underrepresent certain strata in the dataset. Cloudera ML has a sample command that can be used to create stratified samples for text files and Hive tables (via the HCatalog interface to the Hive Metastore) such that N records will be selected for every combination of the categorical variables that define the strata.

The second algorithm is even more interesting: a *weighted* distributed reservoir sample, where every item in the set has an associated weight, and we want to sample such that the probability that an item is selected is proportional to its weight. It wasn’t even clear whether or not this was even possible until Pavlos Efraimidis and Paul Spirakis figured out a way to do it and published it in the 2005 paper “Weighted Random Sampling with a Reservoir.” The solution is as simple as it is elegant, and it is based on the same idea as the distributed reservoir sampling algorithm described above. For each item in the stream, we compute a score as follows: first, generate a random number *R* between 0 and 1, and then take the *n*th root of *R*, where *n* is the weight of the current item. Return the K items with the highest score as the sample. Items with higher weights will tend to have scores that are closer to 1, and are thus more likely to be picked than items with smaller weights.

In Cloudera ML, we use the weighted reservoir sampling algorithm in order to cut down on the number of passes over the input data that the scalable k-means++ algorithm needs to perform. The ksketch command runs the k-means++ initialization procedure, performing a small number of iterations over the input data set to select points that form a representative sample (or *sketch*) of the overall data set. For each iteration, the probability that a given point should be added to the sketch is proportional to its distance from the closest point in the current sketch. By using the weighted reservoir sampling algorithm, we can select the points to add to the next sketch in a single pass over the input data, instead of one pass to compute the overall cost of the clustering and a second pass to select the points based on those cost calculations.

### These Books Behind Me Don’t Just Make The Office Look Good

Interesting algorithms aren’t just for the engineers building distributed file systems and search engines, they can also come in handy when you’re working on large-scale data analysis and statistical modeling problems. I’ll try to write some additional posts on algorithms that are interesting as well as useful for data scientists to learn, but in the meantime, it never hurts to brush up on your Knuth.

“Stay Calm” is probably the most important bit of advice us would be data scientists would receive.

Great explanation of Reservoir Sampling Josh. Very user friendly. Thanks!

You can also have a look at the Vitter paper, which is a kind of state of the art:

http://www.cs.umd.edu/~samir/498/vitter.pdf

Why not sample uniformly on [0, 1] and store the element at had the largest sampled value as scan the list, then return the element with the highest sampled value?