March Madness Prediction
I generally don't invest much time exploring the connections between statistics and sports 1. But I though I'd take advantage of the focus on NCAA basketball to introduce about some work I'd previously done.
Bradley Terry Model
For a project in undergrad I modeled the MLB season using a Bradley Terry model. The general idea is that each team is assigned a score, \(\lambda_{i} > 0\). Then the probability that team \(i\) beats team \(j\) is simply
\[ \frac{\lambda_{i}}{\lambda_{i} + \lambda_{j}} \]
That's it; the model is incredibly simplistic. There exist variants to allow for more complex models, but I'll stick with the simplest.
Fitting the Model
We now have the task of fitting the model. There are options here: you can fit maximum likelihood using an iterative process, or you can take the Bayesian route. Since I was taking a Bayesian class, it's clear what way I went.
However, fitting this model gets a little tricky. The problem is that we don't have easy conjugacy to create a Gibbs sampler. However, no problem, we can always use Metropolis-Hastings sampling. Unfortunately, this didn't work very well. I was writing poor R code, things were slow, and I couldn't get it to converge. With the deadline looming, I needed to try something different.
Then I found this paper: Efficient Bayesian Inference for Generalized Bradley-Terry Models. It turns out you can write a Gibbs sampler for this model, you just need to introduce the right latent variables. It reframes the contest as follows:
Let \(Y_{ki} \sim \text{Exp}(\lambda_{i})\) be the arrival time for team \(i\). A game then is a race, with the team arriving first winning. Using the properties of exponentials we see that
\[ \Pr(Y_{ki} > Y_{kj}) = \frac{\lambda_{i}}{\lambda_{i} + \lambda_{j}} \]
we then introduce the latent variable
\[ Z_{ij} = \sum_{k=1}^{n_{ij}} \min(Y_{kj}, Y_{ki}) \]
which is simply the sum of the winning arrival times for each of the \(n_{ij}\) times team \(i\) and \(j\) played. Again by the properties of exponential distributions we know that this follows a \(\text{Gamma}(n_{ij}, \lambda_{i} + \lambda_{j})\).
Couple this with a \(\text{Gamma}(a,b)\) prior on the \(\lambda_{i}\) we can create a nice Gibbs sampler with the following conditionals.
\[ Z_{ij} | D, \lambda \sim \text{Gamma}(n_{ij}, \lambda_{i} + \lambda_{j}) \] \[ \lambda_{i} \mid D, Z \sim \text{Gamma}(a + w_{i}, b + \sum_{i < j} Z_{ij} + \sum_{j < i} Z_{ji}) \]
where \(w_{i}\) is the number of games team \(i\) won.
Somewhere along the way I lost the R code, so I decided to rewrite it using Julia. The code is hosted at this repository.
Using data from spreadsheet-sports.com we can scrape all of the game scores for the current and the calculation of summary stats is straightforward.
Results
In this season of March Madness, the obvious question is, what team does this model think is the best team in the NCAA? We can draw from the posterior and see which team is ranked highest. The result is summarized in the following table:
Team | \(\Pr(\text{Team is best})\) |
Kentucky | 0.516 |
Villanova | 0.115 |
Wisconsin | 0.107 |
Virginia | 0.073 |
Gonzaga | 0.062 |
Duke | 0.054 |
Arizona | 0.049 |
Notre Dame | 0.014 |
Northern Iowa | 0.007 |
Kansas | 0.003 |
Thus it's clear that Kentucky is heavily favored. This is an unavoidable consequence of using only win record in our model. Since they're undefeated, they are going to have a large \(\lambda_{i}\); in fact, using maximum likelihood we would have \(\lambda_{i} = \infty\).
Footnotes:
And of course, I didn't even find the time to get this post out before the tournament started.