Monty Hall and the Two-Armed Bandit

Suppose, for a moment, that the Monty Hall problem isn’t one of the most heavily-trodden pieces of popular maths of the last three decades, and that – contrary to all good sense – you have no idea of what to do when faced with the same problem.

Since you’ve been living under a rock, I’ll recap the situation: in front of you are three doors. You know that behind one of them is a car (which you want) and behind the two others are goats (which you don’t. You have enough goats, thank you). You pick a door.

Behind at least one of the two other doors is a goat. The host – who knows where the car is – deliberately opens one of these goaty doors and asks “would you rather stick with your original choice or switch to the other closed door?”

We’re supposing you don’t know the answer, but that you’re allowed to play the game as many times as you like, with the goal of winning as many cars as possible. In this post, I’ll show one possible approach: the multi-armed bandit.

The multiarmed bandit (MAB)

The idea of the MAB is to find a balance between exploration and exploitation: you want to test all the possible strategies enough to make sure you know which is best, but you also want to use the best strategy as often as possible.

There are many ways to tackle the problem, but I’m going to jump straight into the beta distribution.

The beta distribution is a sort of binomial distribution in reverse: if you tell it how many successful and unsuccessful outcomes you’ve had from Bernoulli trials with an underlying probability $p$, it gives the probability of $p$ taking on any specific value between 0 and 1. (In fact, there’s a slight wrinkle: the input to the beta distribution is one more than the number of successful and unsuccessful trials - but that’s a small detail.)

(The probability density function of the beta distribution $\beta(a,b)$, in case you want it, is $f(p;a,b) = \frac{\Gamma(a+b)}{\Gamma{a}\Gamma{b}}p^{a-1}(1-p)^{b-1}$.)

So how does it help? We can use the beta distribution to pick our next strategy.

In this example, each time the game is played, one of the strategies will be correct and one will be wrong. It doesn’t matter which one we consider a success; let’s say we count a success any time “switch” would win the car.

We keep track of how many successes and failures we see (not how many times we win the car, here, but whether switching would win), then draw a random number from the appropriate beta distribution. If this number is greater than a half, we choose to switch; if it’s less than a half, we stick.

And what happens? This happens:

Probability graph

Very quickly – after 30 or so trials – it becomes clear that switching is most likely the better strategy (the blue curve approaches 1). More to the point, the green curve (how many times you win) is only very slightly lower than the perfect strategy (always picking switch). The beta distribution approach is not a perfect strategy, but it’s pretty good!

If you want to run your own simulation, I’ve put code here.

The multiarmed bandit (MAB)

A selection of other posts