I was optimizing some Poisson bootstrap code this week. Unsurprisingly, it turns out that most of the time was spent drawing Poisson random variables. When you've reached the point that your code is spending most of its time in the RNG you generally can call it quits. But what if you didn't? What if you could do better?

This isn't as hubristic as you might expect. I only need the \(\lambda=1\) case. Because the library Poisson RNG has to handle various cases of λ, there should be room to specialize and perhaps save some time. Let's find out.

The Competition

Let's start off by scoping out the competition. I'll use julia just because I don't stand a chance with R since I'm actually competing with C in both cases. Interestingly, I'm competing against the same C code since the current version of julia's Distributions package uses RMath library. Let's see how it does.

using Distributions
using BenchmarkTools

function baseline_version()
    return rand(Poisson(1))
end

@benchmark baseline_version()

Julia has a nice BenchmarkTools package, so I'll just paste the results verbatim. We're mostly interested in the last four rows with the minimum, median, mean, and maximum times. Obviously lower is better.

samples 10000
evals/sample 983
time tolerance 5.00%
memory tolerance 1.00%
memory estimate 0.00 bytes
allocs estimate 0
minimum time 61.00 ns (0.00% GC)
median time 64.00 ns (0.00% GC)
mean time 63.74 ns (0.00% GC)
maximum time 194.00 ns (0.00% GC)

It is important to note the units for the timings: we're talking nano-seconds. I typically don't bother with optimizing something like this since I'll have to run it 6*1010 times to make up a minute of programming time. But this is for fun and glory, so I'll do it now.

Finally, as a side note, we can do slightly better by removing the initialization of the Poisson distribution.

const dist = Poisson(1)
function baseline_no_initialization()
    return rand(dist)
end

@benchmark baseline_no_initialization()
samples 10000
evals/sample 985
time tolerance 5.00%
memory tolerance 1.00%
memory estimate 0.00 bytes
allocs estimate 0
minimum time 56.00 ns (0.00% GC)
median time 58.00 ns (0.00% GC)
mean time 57.89 ns (0.00% GC)
maximum time 107.00 ns (0.00% GC)

When something like variable initialization starts making a difference you're far too deep into the weeds. But let's gather up our machetes and press forward.

The First Attempt: What did you expect?

So let's try to do this ourselves. We'll start off with the obvious approach: we draw a uniform random variable and invert the cdf.

function inverse_cdf()
    return quantile(dist, rand())
end

@benchmark inverse_cdf()
samples 10000
evals/sample 161
time tolerance 5.00%
memory tolerance 1.00%
memory estimate 0.00 bytes
allocs estimate 0
minimum time 643.00 ns (0.00% GC)
median time 706.00 ns (0.00% GC)
mean time 706.15 ns (0.00% GC)
maximum time 1.01 μs (0.00% GC)

This is obviously slower; but it'll form a good building block. Let's write our own version.

function loop_version()
    ii = 0
    u = rand()
    while u < ccdf(dist, ii)
        ii += 1
    end
    return ii
end

@benchmark loop_version()
samples 10000
evals/sample 195
time tolerance 5.00%
memory tolerance 1.00%
memory estimate 0.00 bytes
allocs estimate 0
minimum time 559.00 ns (0.00% GC)
median time 659.00 ns (0.00% GC)
mean time 662.23 ns (0.00% GC)
maximum time 911.00 ns (0.00% GC)

This is smarter than it looks: this is actually performing almost binary search. Straightforward binary search works by taking a sorted list and repeatedly testing the middle element. If the target value is equal to the middle element you return; if it's less than you apply binary search to the left side of the array, if it's greater than you apply it to the right. The key insight is that we can split in the array in half and find the value in log(n) time rather than linear time.

Obviously we have a infinite list of values so something is going to have to be modified (unless you have a good way of determining finding element \(\infty/2\)). What we do instead is that we test the median element first and then recurse on that.

Because the tail probabilities decay exponentially, it turns out the binary search progression is exactly the same as linear search except for the beginning, when you should test 1 instead of 0 first. However, when I coded that version up it was slower, so we'll just go in the regular linear order.

Needless to say, it doesn't matter what order we go in since we're not even in the same ballpark as the baseline. Of course, there's far too much post left for me to have left it at that.

The Second Attempt: Success

Well, it's pretty obvious what's going wrong in the previous example: we're calculating the ccdf function far too often. It doesn't change, so we can move this outside the loop and put it in a lookup table. We have a nice mechanism for doing this that doesn't involve an ugly global constant: generated functions.

This also leads to the question of how far we should go; I say 100 is enough. Technically I'm drawing from a truncated Poisson distribution, but the events this far out have three digits in the exponent so I'm pretty safe.

@generated function loop_cached_version()
    N = 100
    ccdfs = [[ccdf(Poisson(1), ii) for ii in 0:N]; 1.0]

    eval = quote
        u = rand()
        ii = 1
        while u < $ccdfs[ii]
            ii += 1
        end
        return ii - 1
    end

    return eval
end

@benchmark loop_cached_version()
samples 10000
evals/sample 993
time tolerance 5.00%
memory tolerance 1.00%
memory estimate 0.00 bytes
allocs estimate 0
minimum time 30.00 ns (0.00% GC)
median time 43.00 ns (0.00% GC)
mean time 42.77 ns (0.00% GC)
maximum time 79.00 ns (0.00% GC)

I'm pretty pleased with that benchmark.

Final Attempt: The Alias Method

Another approach is to use the Alias method. It turns the linear time loop into a constant time array lookup. Here's a quick implementation based on skimming the Wikipedia article.

@generated function alias()
    ## Hacky table generation; you could optimize this better
    N = 100

    U = [N * pdf(Poisson(1), ii) for ii = 0:(N-1)]
    K = [-1 for ii = 0:(N-1)]

    for ii = 1:(N-1)
        ii = indmax(U)
        jj = findfirst((U .< 1.0) & (K .== -1))
        K[jj] = ii - 1
        U[ii] = U[ii] - (1 - U[jj])
    end

    expr = quote
        u = rand()
        ii = floor(Int64, u * $N) + 1
        yy = $N*u + 1 - ii
        return yy < $U[ii] ? ii : $K[ii]
    end

    return expr
end

@benchmark alias()
samples 10000
evals/sample 977
time tolerance 5.00%
memory tolerance 1.00%
memory estimate 16.00 bytes
allocs estimate 1
minimum time 70.00 ns (0.00% GC)
median time 72.00 ns (0.00% GC)
mean time 73.91 ns (1.27% GC)
maximum time 1.66 μs (95.11% GC)

I'm actually fairly surprised that this is so slow. I imagine it's the floor function which requires a type conversion (which explains the memory allocation). It could probably be more competitive if you just worked with the bits yourself. So you'd make a table with 128 entries and index with the first byte and compare with the rest.

Further Directions

Having mentioned working with bits directly brings up another topic. One problem with all of these method is that it's rather wasteful with its bits. In fact the expected number of bits that we need is just the entropy: 1.882. So generating 64 bit numbers is doing a lot more work than is necessary.

To really take advantage of the bits you might try something like a cached implementation. Sketched out it would look something like this:

## This can be tuned for optimal performance
_optimal_buffer_length = ??

type CachedPoisson
    buffer::Vector{Int}
    counter::Int
end

function CachedPoisson()
    buffer = zeros(Int, _optimal_buffer_length)
    counter = _optimal_buffer_length

    return CachedPoisson(buffer, counter)
end

function rand(::CachedPoisson)
    dist.counter += 1
    if dist.counter == _optimal_buffer_length + 1
        u = rand()
        ## Code goes here
        dist.counter = 1
    end
    return dist.buffer[dist.counter]
end

This wrings the most entropy out of the RNG as it can.

Doing it this way should be really fast; when the buffer is non-empty you're just doing a buffer lookup and when the buffer is empty you're doing a single random number generation followed by the usual costs of your usual method.

Conclusion

So it seems like you can do a great deal better than the regular Poisson implementation, at least if you can specialize.

So which of these methods did I end up using? The base implementation of course!