The magic of 2 by 2 tables#
# Our usual imports
import numpy as np
# Make random number generator.
rng = np.random.default_rng()
import pandas as pd
# Safe setting for Pandas. Needs Pandas version >= 1.5.
pd.set_option('mode.copy_on_write', True)
Show code cell content
# HIDDEN
# An extra tweak to make sure we always get the same random numbers.
# We need the same set of random numbers to match the illustrations below.
# Do not use this in your own code; you nearly always want an unpredictable
# stream of random numbers. Making them predictable in this way only makes
# sense for a very limited range of things, like tutorials and tests.
# As you can see, we replace the usual random number generator we
# made above, with another one with predictable random numbers.
rng = np.random.default_rng(seed=1966)
At the end of the noble politics page, you came across an enigma, from analyzing two by two tables of counts.
This page shows you another version of that enigma, and gives a few explanations for these strange effects.
To start with, we make a simple data frame, where each row corresponds to a person. Each person, and therefore, each row, has two labels. If you have read the Noble Politics page, you will see what inspired this example.
The first label for each person, called “Party” can be either “Yes”, or “No”.
The second label for each person, called “Respondent”, can be either “R” or “B”.
Here is our data frame from scratch:
Show code cell source
# Create a data frame from scratch with some data.
people = pd.DataFrame([["Yes", "R"],
["Yes", "R"],
["Yes","R"],
["No", "R"],
["No", "R"],
["Yes","B"],
["Yes","B"],
["No","B"]],
columns = ['Respondent', 'Party'])
# Show the result
people
Respondent | Party | |
---|---|---|
0 | Yes | R |
1 | Yes | R |
2 | Yes | R |
3 | No | R |
4 | No | R |
5 | Yes | B |
6 | Yes | B |
7 | No | B |
We can ask Pandas to make a 2 by 2 table of counts, for these labels. Each entry in the table gives the number of times the combination of labels occurs. For example, the cell in the “Yes” row, “R” column gives the number of rows in the original table that have a “Yes” label for “Party” and a “R” label for “Respondent”. Call this the Yes-with-R cell.
two_by_two = pd.crosstab(people['Party'], people['Respondent'])
two_by_two
Respondent | No | Yes |
---|---|---|
Party | ||
B | 1 | 2 |
R | 2 | 3 |
You might want to confirm these counts. You should see that there are in fact
3 rows which have both the “Yes” label and the “R” label, and this matches the
bottom right entry in the pd.crosstab
table.
We can add the row and column totals to this table, with some fancy Pandas code (unhide the code in this page, or see the notebook version for details):
Show code cell content
# Fancy code for adding row / column totals, giving them colors.
from matplotlib.colors import ListedColormap
PALE_GREY = ListedColormap([0.8, 0.8, 0.8])
MID_GREY = ListedColormap([0.6, 0.6, 0.6])
def with_margins(df):
""" Data frame displayed with row, column totals added, colored.
"""
# Add row totals to copy of data frame.
row_totals = pd.DataFrame([df.sum()], index=['Row totals'])
df_margins = pd.concat([df, row_totals])
# Add column totals.
df_o = df_margins.assign(**{'Col totals': df_margins.sum(axis='columns')})
# Set colors for row, column totals.
styler = df_o.style.background_gradient(subset='Col totals', cmap=PALE_GREY)
styler = styler.background_gradient(subset=('Row totals',), cmap=PALE_GREY)
styler = styler.background_gradient(subset=('Row totals', 'Col totals'),
cmap=MID_GREY)
return styler
# Display the table with colors, using fancy plotting function.
with_margins(two_by_two)
Respondent | No | Yes | Col totals |
---|---|---|---|
B | 1 | 2 | 3 |
R | 2 | 3 | 5 |
Row totals | 3 | 5 | 8 |
Now let’s say we want to create a random association between the “Party” (Yes/No) column and the “Respondent” (R/B) column. We can do that by shuffling (permuting) the Yes/No values, and recreating the table, like this:
parties = people['Party']
respondents = people['Respondent']
permuted_tab_1 = pd.crosstab(rng.permutation(parties), respondents)
with_margins(permuted_tab_1)
Respondent | No | Yes | Col totals |
---|---|---|---|
B | 3 | 0 | 3 |
R | 0 | 5 | 5 |
Row totals | 3 | 5 | 8 |
Note that I could have shuffled the “Respondent” labels instead, or both sets of labels - any of these operations creates what we want - which is a random association between the “Party” and “Respondent” labels.
Let’s make another few of those random tables:
permuted_tab_2 = pd.crosstab(rng.permutation(parties), respondents)
with_margins(permuted_tab_2)
Respondent | No | Yes | Col totals |
---|---|---|---|
B | 2 | 1 | 3 |
R | 1 | 4 | 5 |
Row totals | 3 | 5 | 8 |
permuted_tab_3 = pd.crosstab(rng.permutation(parties), respondents)
with_margins(permuted_tab_3)
Respondent | No | Yes | Col totals |
---|---|---|---|
B | 0 | 3 | 3 |
R | 3 | 2 | 5 |
Row totals | 3 | 5 | 8 |
You will see that, if the top left value goes up, or down, compared to the
original table, then the bottom right value goes up, or down, by the same
amount. Use c
to mean the change in the top-left value from the original.
The original value for that element was 1, so if the new value is 0, \(c = -1\).
If the new value was 2, then \(c = 1\).
You will also see that the bottom left value goes down by the same amount as the top left value goes up. If the top left value goes up by \(c\), then the bottom left value does down by \(c\).
All the values in the table appear to be linked, and move in lock-step.
Notice too that the column and row totals do not — and in fact cannot — change, whatever the association of the R/B and Yes/No labels, for reasons we go into below.
This is another version of the enigma you saw at the end of the Noble Politics page.
Why are the table values linked?#
Explanation 1: the row and column totals cannot change#
The first explanation for the relationship between the values is — the total counts over the rows and columns are fixed.
They are fixed because they depend only on the labels, not on the relationship of the labels. So, we cannot change the counts over rows or columns by shuffling the order of either or both set of labels.
For example, consider the sum of counts over columns. In our example, the columns are for “B” and “R”, and the column sums of the counts are the total number of “B” and “R” labels.
By shuffling the order of the labels, we can change the number in the B-with-Yes cell, but we can’t change the total number of “B” labels, so, if the B-with-Yes number increases, the B-with-No number has to decrease by the same amount.
The same argument applies for the row totals.
You may now see that, if you give me the value for any one of the four elements in the table, I can fill in the rest, because they follow from the fact that I know what the rows and columns have to add up to.
Explanation 2: what goes up, must come down#
Here we think about what happens as we permute the Yes/No labels. The same argument applies to permuting the R/B labels, and in fact, to permuting both the labels.
Consider the original table.
with_margins(two_by_two)
Respondent | No | Yes | Col totals |
---|---|---|---|
B | 1 | 2 | 3 |
R | 2 | 3 | 5 |
Row totals | 3 | 5 | 8 |
Now consider permutations in the order of the Yes/No column of the original table. We will just permute the first “Yes”, for now.
Imagine the first “Yes” gets swapped with the “Yes” in one of the other “R”, “Yes” rows. We still end up with the same two by two counts table.
Next swap the first “Yes” with the “No” in one of the “R”, “No” rows. The first row is “R”, “No” but the other row has now become “R”, “Yes”, we still have the same number of “R”, “Yes” and “R”, “No”, and the counts table doesn’t change.
Now swap the first “Yes” with the “Yes” in one of the “B”, “Yes” rows. We still have the same number of “R”, “Yes” and “B”, “Yes” rows, just in a different row order, and the counts table doesn’t change.
Last, we swap the first “Yes” with the “No” in one of the “B”, “No” rows. Now, for the first time, the counts table does change, like this:
Show code cell source
# The data frame with first "Yes" permuted to last row.
perm_people = pd.DataFrame([["No", "R"],
["Yes", "R"],
["Yes","R"],
["No", "R"],
["No", "R"],
["Yes","B"],
["Yes","B"],
["Yes","B"]],
columns = ['Respondent', 'Party'])
perm_people
Respondent | Party | |
---|---|---|
0 | No | R |
1 | Yes | R |
2 | Yes | R |
3 | No | R |
4 | No | R |
5 | Yes | B |
6 | Yes | B |
7 | Yes | B |
Here is the cross-tabulation for that case:
Show code cell source
perm_crosstab = pd.crosstab(perm_people['Party'], perm_people['Respondent'])
with_margins(perm_crosstab)
Respondent | No | Yes | Col totals |
---|---|---|---|
B | 0 | 3 | 3 |
R | 3 | 2 | 5 |
Row totals | 3 | 5 | 8 |
When we do this single change, we have:
One less “R”, “Yes” row (bottom-right count goes down by 1)
One less “B”, “No” row (top-left count down by 1)
One more “R”, “No” row (bottom-left count up by 1)
One more “B”, “Yes” row (top-right count up by 1)
As we permute the “Yes”, “No” labels, we either leave the total number of unique “R/B”, “Yes/No” pairings unchanged, or we change the number of all the pairings at once.