Cross-tabulation#
Show code cell content
import numpy as np
import pandas as pd
# Safe setting for Pandas. Needs Pandas version >= 1.5.
pd.set_option('mode.copy_on_write', True)
It is often useful to make tables that count rows in categories defined by columns.
Here we will use a dataset with data about passengers on the RMS Titanic.
See the Titanic dataset page for some more information about this dataset, and a link to download the data.
# Load the dataset
df = pd.read_csv('titanic_clean.csv')
df.head()
| name | gender | age | class | embarked | country | fare | survived | |
|---|---|---|---|---|---|---|---|---|
| 0 | Abbing, Mr. Anthony | male | 42.0 | 3rd | Southampton | United States | 7.11 | no |
| 1 | Abbott, Mr. Eugene Joseph | male | 13.0 | 3rd | Southampton | United States | 20.05 | no |
| 2 | Abbott, Mr. Rossmore Edward | male | 16.0 | 3rd | Southampton | United States | 20.05 | no |
| 3 | Abbott, Mrs. Rhoda Mary 'Rosa' | female | 39.0 | 3rd | Southampton | England | 20.05 | yes |
| 4 | Abelseth, Miss. Karen Marie | female | 16.0 | 3rd | Southampton | Norway | 7.13 | yes |
As you can see, we have various bits of information about the passengers on
the Titanic. Several columns give labels - categories. For example, the
gender column categorizes the passenger as male or female, and the
survived column categorizes the passenger by whether they survived the
sinking - yes or no.
A one-way table#
As you know, we can use the value_counts method of the data frame to see the
unique values for any column, with the counts of the number of times each
value occurs.
# Counts of each unique value in the 'survived' column.
df['survived'].value_counts()
survived
no 743
yes 470
Name: count, dtype: int64
Often we want to see proportions instead of counts. We can use the normalize keyword argument for this:
# Proportions of each unique value in the 'survived' column.
df['survived'].value_counts(normalize=True)
survived
no 0.612531
yes 0.387469
Name: proportion, dtype: float64
Two-way tables#
We often want to break down the rows by more than one category. Remember that
the Titanic passengers and crew tended to give preference to women and
children, when loading the lifeboats. So, we may want to see the counts of
passengers who survived, broken down by gender.
This is the role of the Pandas crosstab function. It is a Pandas function
because it is function inside the Pandas module; we can get this function with
pd.crosstab (assuming we have done the usual import pandas as pd).
The first argument to pd.crosstab is the category we want to see in the rows; the second argument is the category we want to see in the columns.
Here is a cross-tabulation of gender (in the rows) by survived (in the columns):
# Cross-tabulation of counts for 'gender' (rows) by 'survived' (columns).
pd.crosstab(df['gender'], df['survived'])
| survived | no | yes |
|---|---|---|
| gender | ||
| female | 120 | 321 |
| male | 623 | 149 |
We will often want to see these values as proportions rather than counts. For
example, we may be interested in the proportion of women and men that
survived. As for value_counts above, we use the normalize keyword to ask
for proportions. This time we have to specify the direction that Pandas
should use for the proportion. We could be interested in the proportion
across the column (proportions of male and female passengers within the
yes “survived” category, likewise for the no category). More likely, in
this case, we will be interested in proportions across the row (proportion who
survived within male category, proportion who survived within female
category). We give Pandas this information with the value for the
normalize keyword argument. Pandas uses the term index to refer to the
rows. Remember, Pandas also uses the term index for the row labels.
# Cross-tabulation of proportions for 'gender' (rows) by 'survived' (columns).
# Proportions are over the row.
pd.crosstab(df['gender'], df['survived'], normalize='index')
| survived | no | yes |
|---|---|---|
| gender | ||
| female | 0.272109 | 0.727891 |
| male | 0.806995 | 0.193005 |
We could also ask for the proportions over the columns.
# Cross-tabulation of proportions for 'gender' (rows) by 'survived' (columns).
# Proportions are over the column.
pd.crosstab(df['gender'], df['survived'], normalize='columns')
| survived | no | yes |
|---|---|---|
| gender | ||
| female | 0.161507 | 0.682979 |
| male | 0.838493 | 0.317021 |
Of course, the arguments we are passing to pd.crosstab are Pandas Series. We can pass any Pandas series as our row or column specification.
For example, we might want to make a new Series that categorizes passengers as
male, female or child. We can use that Series as an argument to
pd.crosstab.
mwc = df['gender']
mwc[df['age'] <= 16] = 'child'
pd.crosstab(mwc, df['survived'], normalize='index')
| survived | no | yes |
|---|---|---|
| gender | ||
| child | 0.467153 | 0.532847 |
| female | 0.257979 | 0.742021 |
| male | 0.831429 | 0.168571 |
Three-way tables (or four …)#
Another factor that we know was in play was passenger class. For various reasons, first class passengers, in particular, got better access to the lifeboats than passengers in other classes. This adds a further factor that might explain the numbers and proportions that survived.
Now we have two factors (categories) that we think may predict survival - class and gender.
pd.crosstab takes this in its stride. We can just pass a list of Series as our row or column specification.
# Counts for passengers who survived, broken down by class and male, female,
# child.
categories = [df['class'], mwc]
pd.crosstab(categories, df['survived'])
| survived | no | yes | |
|---|---|---|---|
| class | gender | ||
| 1st | child | 1 | 8 |
| female | 4 | 124 | |
| male | 109 | 55 | |
| 2nd | child | 1 | 25 |
| female | 11 | 77 | |
| male | 136 | 13 | |
| 3rd | child | 62 | 40 |
| female | 82 | 78 | |
| male | 337 | 50 |
normalize over the rows will give proportions for each sub-category:
# Proportions surviving in sub-categories of class, male/female/child.
pd.crosstab(categories, df['survived'], normalize='index')
| survived | no | yes | |
|---|---|---|---|
| class | gender | ||
| 1st | child | 0.111111 | 0.888889 |
| female | 0.031250 | 0.968750 | |
| male | 0.664634 | 0.335366 | |
| 2nd | child | 0.038462 | 0.961538 |
| female | 0.125000 | 0.875000 | |
| male | 0.912752 | 0.087248 | |
| 3rd | child | 0.607843 | 0.392157 |
| female | 0.512500 | 0.487500 | |
| male | 0.870801 | 0.129199 |
You can specify multiple Series for the columns of the table. Here we just run the equivalent table but reversing the rows and columns:
# Proportions surviving in sub-categories of class, male/female/child,
# with 'survived' in the rows, and the subcategories in the columns.
pd.crosstab(df['survived'], categories, normalize='columns')
| class | 1st | 2nd | 3rd | ||||||
|---|---|---|---|---|---|---|---|---|---|
| gender | child | female | male | child | female | male | child | female | male |
| survived | |||||||||
| no | 0.111111 | 0.03125 | 0.664634 | 0.038462 | 0.125 | 0.912752 | 0.607843 | 0.5125 | 0.870801 |
| yes | 0.888889 | 0.96875 | 0.335366 | 0.961538 | 0.875 | 0.087248 | 0.392157 | 0.4875 | 0.129199 |
You can specify more than two columns in the row or column arguments, or both. Try experimenting with different arguments.