Introduction

We want to get a random sample of size \(n\) from a population of size \(N\). The thing is, we don't know \(N\).

I'll admit, at first I thought this was just a fun mathematical toy, but there's actually practical situations where this could be useful. Suppose I want a random sample of 10,000 tweets from all of the tweets generated in the next 24 hours. I could store all of these tweets and then sample from that, but that method is going to take far more time and memory than is really necessary. It's a two-pass algorithm; the first pass collects and the second pass samples. What we want is a single-pass algorithm which only looks at each tweet once.

Reservoir Sampling

A simple solution is to maintain a reservoir, a collection of \(n\) items, which contains your sample at any given time. When you encounter a new element you have to decide to add it to the reservoir or discard it. If you add the new element then an old element must be discarded to maintain the correct size.

For a random sample we make these decisions randomly, in effect tossing a coin to decide whether the item should be in or out. However, we need to be careful to maintain uniform inclusion probabilities 1. To illustrate, suppose we just wanted a single sample and just flipped a fair coin. Then the last element we see has a 50% chance of being selected, the penultimate 25%, the antepenultimate 12.5%, and the preantepenultimate 6.25%.

The solution then is to flip a biased coin with probability of inclusion \(\frac{n}{i}\) for the ith element. We then select with equal probabilities one of the elements of the reservoir to discard.

Quick proof of correctness

Now we should show that we've actually achieved the goal; if we have a random sample we would should have the probability of inclusion, denoted as \(\Pr(i ∈ S)\) where S is the set of included indices, equal to \(\frac{n}{N}\) for all elements.

The key to proving this is to notice that two events have to happen for \(i\) to be in the sample. First we have to have chosen to include \(i\) in the sample. Then we need to avoid replacing \(i\) for all \(j >i\).

We'll only work out the math for the case where \(i > n\) 2.

\[ \Pr(i \in S) = \Pr(\text{$i$ is selected})\Pr(\text{$i$ is not replaced} \mid \text{$i$ is selected}) \]

We need to calculate the probability of selecting \(i\). This is obviously just \(\frac{n}{i}\).

Now we need to calculate the probability of not replacing \(i\). The probability for this is the product

\[ \Pr(\text{$i$ is not replaced by $i + 1$}) \Pr(\text{$i$ is not replaced by $i + 2$} \mid \text{$i$ is not replaced before $i + 2$}) \dots \Pr(\text{$i$ is not replaced by $N$} \mid \text{$i$ is not replaced before $N$}) \]

where each term is just

\[ \Pr(\text{$i$ is not replaced by $j$} \mid \text{$i$ is not replaced before $j$}) = \frac{j-1}{j} \]

and thus the product is simply

\[ \frac{i}{i+1} \times \frac{i+1}{i+2} \times \dots \times \frac{N -1}{N} = \frac{i}{N} \]

Combining these two parts.

\[ \Pr(i \in S) = \Pr(\text{$i$ is selected})\Pr(\text{$i$ is not replaced} \mid \text{$i$ is selected}) = \left(\frac{n}{i}\right)\left(\frac{i}{N}\right) = \frac{n}{N} \]

Problem solved.

Julia Implementation

function reservoir_sample(source, n::Int)
    state = start(source)

    nSeen = 0
    sample = Array(Any, n) # can be specialized

    ## automatically add first n items
    while nSeen < n
        nSeen += 1
        item, state = next(source, state)
        sample[nSeen] = item

        ## check for early exhaustion of source
        if done(source, state)
            error("Source is not sufficiently long")
        end
    end

    ## then continue until all items are exhausted
    while !done(source, state)
        item, state = next(source, state)
        nSeen += 1

        r = rand(1:nSeen)
        if r <= n
            sample[r] = item
        end
    end  

    return sample
end

A note on the second-order probabilities

One subtlety that might escape the notice of non-statisticians is that we not only care about the probability of inclusion of a single element, but also second and higher order inclusion probabilities. The nth-order inclusion probabilities are simply the probabilities \(\Pr(z_{1} \in S, z_{2} \in S, \dots z_{n} \in S)\) for any permutation \(z\).

To illustrate why we might care about these properties suppose you have 10 Democrats and 10 Republicans in a room and you want to estimate the President's approval rating. Because you don't want to interview them all, you'll settle with a sample of 10. One method which gives you equal first-order selection probabilities is to flip a coin: heads you select all of the Democrats, tails you select all of the Republicans. Your estimate will be unbiased, but nobody would think this is a good sampling strategy because your variance is going to be much larger than necessary.

Now, Wikipedia states that the second-order probabilities depend upon the order of the records:

While the first-order selection probabilities are equal to k/n … the second order selection probabilities depend on the order in which the records are sorted in the original reservoir.

This is false. We can prove this by induction. If we're selecting n elements our base case will be the stage where where have seen \(n\) elements from a source of length \(N\). Clearly the second-order inclusion probabilities are equal since they're all 1 as everything is in our sample.

Now if we have equal second-order inclusion probabilities when we've seen \(k \ge n\) elements we just have to show that we still have equal second-order inclusion probabilities when we've seen \(k+1\) elements. We have two cases, the first where we include the new element \(k+1\)

\[ \Pr_{k+1}(i \in S, k+1 \in S) = \Pr_{k}(i) \times \Pr(\text{$k+1$ is selected but does not replace $i$}) = \frac{n}{k} \times \frac{n-1}{k+1} \]

and the second in which \(i\) and \(j\) are both less than \(k+1\).

\[ \Pr_{k+1}(i \in S, j \in S) = \Pr_{k}(i, j) \times \Pr(\text{neither $i$ or $j$ are replaced}) = \Pr_{k}(i, j) \times \frac{(k+1)-2}{k+1}\]

and because \(\Pr_{k}(i,j)\) are all equal, it must be equal to \(\frac{n(n-1)}{k(k-1)}\) so the above expression is just

\[ \Pr_{k+1}(i \in S, j \in S) = \Pr_{k}(i, j) \times \frac{(k+1)-2}{k+1} = \frac{n}{k} \times \frac{n-1}{k+1} \]

And thus we've shown all of the second-order probabilities are equal at step \(k+1\). Apply induction and it holds for all of our steps. A similar argument could be made for all of the higher-order inclusion probabilities.

Footnotes:

1

Or not, if we're doing something like estimating a population mean we can use the Horvitz-Thompson estimator which takes into account differing inclusion probabilities so our estimate remains unbiased. It messes with your variance though.

2

Proving the case where \(i \le n\) would be a good check for understanding the proof for \(i > n\).