First Bayes#
In the problem for the education minister we had a sample of fast-track-marked exams from 2019, and we found that the mean mark was about 58.74. We wondered what we could say about the eventual mean when we have the marks for all 8000 or so students.
For example, we might wonder how likely it is that the eventual mean will be near 65.25, as it was in 2018. Or we might wonder whether we could be say that the eventual mean for all the papers will around the sample mean — 58.74 — plus or minus a bit. If so, what value should we give to “a bit”?
This kind of problem can be called a problem of reverse probability.
We start with simple probabilities, where we ask questions like this: what is the probability of seeing a mean around 58.74 if the population mean is actually 65.25? Then we go in reverse to ask questions like: what is the probability that the population mean is around 65.25 given the sample mean of 58.74?
A reverse probability game#
Imagine I offer you one of two boxes.
One box has four red balls and one green ball. Call this BOX4.
The other box has two red balls and three green balls. Call this BOX2.
I haven’t told you which box I gave you, but I do tell you that there is a 30% chance that I gave you BOX4, and a 70% chance I gave you BOX2.
Now let’s say that you shake the box I gave you, to shuffle the balls, then close your eyes, and take out one ball. You open your eyes to find you have a red ball.
What is the chance that I gave you BOX4?
This is an example of a reverse probability problem. You are working back from what you see (the red ball) to what I gave you (the box).
In our exam mark problem, we are working back from what we saw (the sample mean of 54.51) to the eventual mean for all the exams.
How are we going to start on our solution to the BOX4, BOX2 reverse probability problem? Simulation!
import numpy as np
# Make random number generator.
rng = np.random.default_rng()
import pandas as pd
pd.set_option('mode.copy_on_write', True)
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
# Make a box with 4 red balls and 1 green ball
box4 = np.repeat(['red', 'green'], [4, 1])
box4
array(['red', 'red', 'red', 'red', 'green'], dtype='<U5')
# Make a box with 2 red balls and 3 green balls
box2 = np.repeat(['red', 'green'], [2, 3])
box2
array(['red', 'red', 'green', 'green', 'green'], dtype='<U5')
Now we make 10000 trials, where we:
Choose BOX4 or BOX2, with a 30% chance of BOX4.
Choose a ball at random from the resulting box.
n_iters = 10000
# The box for this trial.
box_nos = np.repeat([1], n_iters)
# The color of the ball we chose.
ball_colors = np.repeat(['green'], n_iters)
for i in np.arange(n_iters):
# Choose a box number with a 30% chance of BOX4
box_no = rng.choice([4, 2], p=[0.3, 0.7])
# Choose a ball at random from the box.
if box_no == 4:
# Choose a ball at random from BOX4.
ball_color = rng.choice(box4)
else: # box 4
# Choose a ball at random from BOX2.
ball_color = rng.choice(box2)
# Store the results.
box_nos[i] = box_no
ball_colors[i] = ball_color
Last we put the results into a data frame for convenience:
# Make these into a data frame.
trial_results = pd.DataFrame()
trial_results['box no'] = box_nos
trial_results['ball color'] = ball_colors
trial_results.head()
box no | ball color | |
---|---|---|
0 | 2 | red |
1 | 2 | red |
2 | 2 | red |
3 | 2 | green |
4 | 2 | red |
Now we can see the proportion of trials on which we drew a red ball, where the box we got was BOX4.
# Of the trials giving a red ball, what proportion came from box 4?
red_ball_trials = trial_results[trial_results['ball color'] == 'red']
p_box4 = np.count_nonzero(red_ball_trials['box no'] == 4) / len(red_ball_trials)
p_box4
0.4578937272204105
Of the trials giving a red ball about 46% came from BOX4. If we see a red ball, there is a 46% chance we have sampled from BOX4.
You have just solved your first problem in reverse probability. The problem will soon reveal a simple calculation in probability called Bayes theorem.
This is a fundamental building block, so let’s go back over the simulation, to think about why we got this number.
We can think of all these trials as coming about from a branching tree.
At the first branching point, we split into two branches, one for BOX4 and one for BOX2. The BOX4 branch is width 0.3 and the BOX2 branch is width 0.7, because the probability of BOX4 is 0.3 (30%).
The simulation is very unlikely to give these numbers exactly, because it took a random sample. So, the simulation proportions will be close to the probabilities we calculated above, but not exactly the same.
box4_trials = trial_results[trial_results['box no'] == 4]
box2_trials = trial_results[trial_results['box no'] == 2]
n_trials = len(trial_results)
print('Box4 proportion', len(box4_trials) / n_trials)
print('Box2 proportion', len(box2_trials) / n_trials)
Box4 proportion 0.2977
Box2 proportion 0.7023
At the second branching point, each branch splits into two.
The BOX4 branch splits into a “red” branch, which carries 4/5 (0.8, 80%) of the BOX4 trials, and a “green” branch, that carries 1/5 (0.2, 20%) of the BOX4 trials, because the probability of getting a red ball from BOX4 is 4 in 5.
The BOX2 branch splits into a “red” branch, which carries 2/5 (0.4, 40%) of the BOX2 trials, and a “green” branch, which carries 3/5 (0.6, 60%) of the BOX2 trials, because the probability of getting a red ball from BOX2 is 2 in 5.
Thus the proportion of trials that are both from BOX4 and give a red ball is 0.3 (the width of the BOX4 branch) * 0.8 (the proportion of BOX4 trials that give red) = 0.24.
box4_and_red = box4_trials[box4_trials['ball color'] == 'red']
prop_box4_and_red = len(box4_and_red) / n_trials
print('Box4 and red proportion', prop_box4_and_red)
Box4 and red proportion 0.2387
The proportion of trials that are both from BOX2 and give a red ball is 0.7 (the width of the BOX2 branch) * 0.4 (the proportion of BOX2 trials that give red) = 0.28.
box2_and_red = box2_trials[box2_trials['ball color'] == 'red']
prop_box2_and_red = len(box2_and_red) / n_trials
print('Box2 and red proportion', prop_box2_and_red)
Box2 and red proportion 0.2826
We get the overall proportion of red by adding the proportion that is BOX4 and red to the proportion that is BOX2 and red, because these are all the red trials. This is 0.24 + 0.28 = 0.52.
n_red = len(box4_and_red) + len(box2_and_red)
prop_red = n_red / n_trials
print('Overall proportion of red', prop_red)
Overall proportion of red 0.5213
We’ve already discovered about that 0.24 (24%) of all trials are BOX4 and red. So the proportion of all red trials, that are BOX4 and red, is 0.24 / 0.52 = 0.4615385.
print('Proportion of all red trials that are box4', (prop_box4_and_red / prop_red))
Proportion of all red trials that are box4 0.4578937272204105
To go over the logic again:
We want the proportion of “red” trials that came from BOX4.
To do this, we calculate the proportion of trials that are both BOX4 and red, and divide by the overall proportion of red trials.
The proportion of red trials that are both BOX4 and red is (the proportion of BOX4 trials) multiplied by (the proportion of BOX4 trials that are red).
We have just discovered Bayes theorem.