I generally don't follow the news. Thus, when I was asked this week how I thought Britain's vote to exit the European Union (aka Brexit) would turn out I followed my usual procedure: I looked at the prediction markets.

Prediction markets are markets where you can speculate on the outcome of events. The way it works is that you can buy and sell shares of the event "The outcome of the British vote to exit the European Union will be Leave". If Brexit occurs, your shares pay out $1. If no Brexit occurs, you get nothing. Clearly, if you think Brexit will occur with probability \(p\) you should only buy shares when the price is greater than \(p\) 1. Typically you interpret the current market price as the consensus probability that the event will occur.

I don't participate in prediction markets (I don't follow the news, remember?), but I find them quite fascinating. The Brexit situation brought up an interesting question. At that time that I checked the market odds on the aggregator Predictwise was around 75% that the vote would fail. The vote ended up succeeding. Should I interpret this as a sign that prediction market odds should not be trusted?

The short answer is no. If we take the 75% probability at face value, I shouldn't be surprised by the result; such an event happens one time in every four. But that's too easy; I can't always appeal to being unlucky. So what can I do?

Calibration

In Signal and the Noise Nate Silver argues that the best way to judge a forecaster's accuracy is calibration. Do the events that they predict will happen X% of the time actually occur about X% of the time?

"One of the most important tests of a forecast — I would argue that it is the single most important one — is called calibration. Out of all the times you said there was a 40 percent chance of rain, how often did rain actually occur? If over the long run, it really did rain about 40 percent of the time, that means your forecasts were well calibrated." Nate Silver, Signal and the Noise

I don't find calibration nearly as compelling.

Consider the forecasting the party that won the past few presidential elections; the sequence is [Rep, Rep, Dem, Dem, Rep, Rep, Dem, Dem].

Suppose I gave a 50% chance to a Republican victory in each election because that seems to be roughly the base rate. I'm perfectly calibrated on this data.

But now consider my non-paradox-inducing-time-traveller self who gives the following sequence of predictions for a Republican victory [1, 1, 0, 0, 1, 1, 0, 0]; that prediction is a whole lot more useful, but no more calibrated than my uninformed prediction.

Now, consider that my time-traveller self can't outright tell me the future (for fear of a paradox) so he rolls a d20 and on a 1 he lies to me about the outcome; my prediction now becomes [0.95, 0.95, 0.05, 0.05, 0.05, 0.95, 0.05, 0.05]. That's going to be uncalibrated, but seems a lot better than my base rate prediction.

From this we see that calibration can't distinguish between useful (time-traveller) and non-useful (base rate) calibrated predictions and it punishes useful, but poorly calibrated (d20) predictions. So, while it's nice to be calibrated, we should look a little harder for a way to test our forecasts.

Scoring Rules

An alternative to calibration is to score predictions and compare numbers. The measure that I typically see for these types of comparisons is the Brier score.

\[ \begin{equation} \text{Score} = \frac{1}{N} \sum_{t=1}^{N} (p_{t} - X_{t})^{2} \]

where \(p_{t}\) is your probability estimate and \(X_{t}\) is an indicator for the event. The Brier score is an example of a proper scoring rule such that your true probability estimate \(\hat{p}_{t}\) uniquely optimizes this loss function. I find the logarithmic rule,

\[\frac{1}{N} \sum_{t=1}^{}^{N} -(X_{t}log(p_{t}) + (1 - X_{t})\log(1 - p_{t}))\], appealing, but I'll admit I don't quite understand why you would choose one rule over another.

If we apply this to the election prediction competition between my current and time-travelling selves, we see that the prediction of 0.5 results in a Brier score of 0.25 and -log(2) while the time traveller scores 0 and 0 with lower being better and worse respectively.

One thing doesn't make sense. The probability estimate should be changing all the time. For instance, Nate Silver put single digit odds on Trump's nomination until primaries started. Now he's getting flack for it and even wrote a post apologizing for it. But, especially early on, that's an entirely reasonable position to have with the available information. However, the Brier score is still going to penalize him relative to someone who started predicting much later in the primary season when more information is available.

Efficient Market Hypothesis

So it seems that just by looking at predictions we're going to have a hard time judging forecasters. So what if take a different approach and try creating a process which we trust will give us the best available probabilities. This might result in something like prediction markets.

Our hope is that the Efficient-Market Hypothesis (EMH) holds: the prices of assets in a market already reflects the available information. There are a couple variants, but the most realistic formulation states that there's no way to consistently outperform on the market prediction without possessing new information. Indeed, a key feature of markets is that they provide encouragement for participants to make the market efficient. The idea is that if you can outperform the market you'll eventually change the market so that your advantage disappears.

That's the theory anyways. I'm not particularly sure how to quantify or test this. I certainly act as if the EMH is true for the stock market (and thus I can't make money), but I doubt that it's true in currently existing prediction markets. A big reason is that they're just far too small-stakes. Byzantine (and archaic, I might add) "gambling" regulations mean that the US-based prediction markets like Iowa Electronic Markets or PredictIt have caps on the amount of money you can wager. The other market is Betfair, but even there the volume is minuscule compared to other markets such as currency market.

There's also the issue of the usual set of cognitive biases. Given known phenomenon like anchoring, availability bias, bandwagon effect, and the ostrich effect, how can we expect markets, made up of irrational traders, not to exhibit the same effects? I'm less convinced of this; it seems that the logic of the EMH should still hold. If the market price is driven by irrational biases, somebody should be able to profit by correcting the market.

Conclusion

If it's this difficult to determine a prediction's merit, you might wonder what's the point of evaluating these forecasts in the first place. Why should we care?

Because otherwise we're incredibly unlikely to form true opinions. To quote George Orwell

"Hence the contradictions and absurdities I have chronicled above, all finally traceable to a secret belief that one’s political opinions, unlike the weekly budget, will not have to be tested against solid reality." George Orwell

If you don't pay attention to whether your favorite pundit/blog/algorithm/palmreader makes good predictions, you're just going to hear what you want to hear. Occasionally you'll be greatly upset by something "unexpected", but it's quite easy to move on and continue as if nothing ever happened. Admittedly, for things like sports predictions, this isn't much of a problem. However, for something like Brexit, the cost of having unpredictive beliefs is somewhat higher.

Sometimes the costs are a rather concrete. The person who asked my opinion about Brexit has the annoying habit of being "100% certain" of whatever event is under discussion. I derived a unvirtuously smug sense of satisfaction when offering a bet at 1:100 odds for Brexit revealed that "100% certain" is less than 99% certain2. Which is a shame; I could have had a rather nice dinner to celebrate.

Footnotes:

1

There are, inevitably, a bunch of technical caveats. For example If I expect to lose money from a Brexit, I might buy Brexit shares as a hedge and thus accept a slightly higher price than my estimate of the probability suggests

2

If you read my last post, you'll notice that I tend to take things a bit literally. I will argue, however, that I'm doing a public service by pointing out these imprecisions in language. To quote Orwell again, our language "becomes ugly and inaccurate because our thoughts are foolish, but the slovenliness of our language makes it easier for us to have foolish thoughts."