The power of groupby#
Show code cell content
import numpy as np
import pandas as pd
# Safe setting for Pandas. Needs Pandas version >= 1.5.
pd.set_option('mode.copy_on_write', True)
This page follows from the Crosstab page.
As for the Crosstab page, we will use a dataset with data about passengers on the RMS Titanic.
See the Titanic dataset page for some more information about this dataset, and a link to download the data.
# Load the dataset
df = pd.read_csv('titanic_clean.csv')
df.head()
name | gender | age | class | embarked | country | fare | survived | |
---|---|---|---|---|---|---|---|---|
0 | Abbing, Mr. Anthony | male | 42.0 | 3rd | Southampton | United States | 7.11 | no |
1 | Abbott, Mr. Eugene Joseph | male | 13.0 | 3rd | Southampton | United States | 20.05 | no |
2 | Abbott, Mr. Rossmore Edward | male | 16.0 | 3rd | Southampton | United States | 20.05 | no |
3 | Abbott, Mrs. Rhoda Mary 'Rosa' | female | 39.0 | 3rd | Southampton | England | 20.05 | yes |
4 | Abelseth, Miss. Karen Marie | female | 16.0 | 3rd | Southampton | Norway | 7.13 | yes |
In the Crosstab page, we calculated counts and proportions of rows falling into categories given by Pandas Series. These Series are often columns from a Pandas data frame.
Pandas also has a very powerful mechanism for grouping observations by
category, available by the Pandas data frame groupby
method.
The transformations available with groupby
are far more general than those
in pd.crosstab
. Among many other things, groupby
allows us to calculate
many different summaries of rows in different categories.
For much more detail on groupby
, see the Python Data Science Handbook
chapter,
by Jake Vanderplas.
Just for example, imagine we were interested in the average age of the
passengers in each class
.
Groups and GroupBy#
We could start by making a Pandas groupby
object, like this:
by_class = df.groupby('class')
by_class
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f823f5481d0>
The GroupBy object has categorized the rows according to the values in the
class
column. Therefore, the GroupBy object has three groups, one for
1st
, one of 2nd
and one for 3rd
.
We can see how many groups we have with Python’s len
function:
# The number of groups.
len(by_class)
3
If the fancy takes us, we can fetch the rows corresponding to the groups with
the get_group
method.
first_class_passengers = by_class.get_group('1st')
first_class_passengers.head()
name | gender | age | class | embarked | country | fare | survived | |
---|---|---|---|---|---|---|---|---|
20 | Allen, Miss. Elisabeth Walton | female | 29.000000 | 1st | Southampton | United States | 211.6009 | yes |
22 | Allison, Master. Hudson Trevor | male | 0.916667 | 1st | Southampton | Canada | 151.1600 | yes |
23 | Allison, Miss. Helen Loraine | female | 2.000000 | 1st | Southampton | Canada | 151.1600 | no |
24 | Allison, Mr. Hudson Joshua Creighton | male | 30.000000 | 1st | Southampton | Canada | 151.1600 | no |
25 | Allison, Mrs. Bessie Waldo | female | 25.000000 | 1st | Southampton | United States | 151.1600 | no |
This is exactly the same result we would get by selecting the first class passenger rows directly:
first_again = df[df['class'] == '1st']
first_again.head()
name | gender | age | class | embarked | country | fare | survived | |
---|---|---|---|---|---|---|---|---|
20 | Allen, Miss. Elisabeth Walton | female | 29.000000 | 1st | Southampton | United States | 211.6009 | yes |
22 | Allison, Master. Hudson Trevor | male | 0.916667 | 1st | Southampton | Canada | 151.1600 | yes |
23 | Allison, Miss. Helen Loraine | female | 2.000000 | 1st | Southampton | Canada | 151.1600 | no |
24 | Allison, Mr. Hudson Joshua Creighton | male | 30.000000 | 1st | Southampton | Canada | 151.1600 | no |
25 | Allison, Mrs. Bessie Waldo | female | 25.000000 | 1st | Southampton | United States | 151.1600 | no |
Indexing with column labels#
As for DataFrames, you can index a GroupBy with column labels, to select the
given column(s) to work on later. As for DataFrames, this is direct indexing
with column labels (DICL). For example, here we index the GroupBy to select
the numerical columns age
and fare
:
by_class_fa = by_class[['age', 'fare']]
by_class_fa
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f8278d2f3d0>
Notice that the GroupBy now only contains the age
and fare
columns.
by_class_fa.get_group('1st')
age | fare | |
---|---|---|
20 | 29.000000 | 211.6009 |
22 | 0.916667 | 151.1600 |
23 | 2.000000 | 151.1600 |
24 | 30.000000 | 151.1600 |
25 | 25.000000 | 151.1600 |
... | ... | ... |
1191 | 43.000000 | 35.0000 |
1194 | 21.000000 | 61.0707 |
1199 | 46.000000 | 35.1000 |
1201 | 62.000000 | 26.1100 |
1204 | 36.000000 | 135.1208 |
301 rows × 2 columns
You can also use index with a single column label, so the GroupBy only has a single column:
by_class_age = by_class['age']
by_class_age
<pandas.core.groupby.generic.SeriesGroupBy object at 0x7f823f55ce10>
Notice that the type of thing that came back has changed from a DataFrame GroupBy to a Series GroupBy, reflecting the fact we are working on a single column.
Aggregating#
The true power of GroupBy is in doing efficient operations on the groups, and returning the results.
Just for example, imagine that we wanted to get the mean age
and fare
for
each class.
Here is how we could do that for the whole DataFrame:
df[['age', 'fare']].mean()
age 29.698269
fare 32.346953
dtype: float64
We could do this for each class by going the long way round For each class we identify the matching rows, and columns of interest, then take their mean.
print('1st mean', df.loc[df['class'] == '1st', ['age', 'fare']].mean())
print('2nd mean', df.loc[df['class'] == '2nd', ['age', 'fare']].mean())
print('3rd mean', df.loc[df['class'] == '3rd', ['age', 'fare']].mean())
1st mean age 39.441584
fare 84.256920
dtype: float64
2nd mean age 30.193916
fare 21.543087
dtype: float64
3rd mean age 24.978557
fare 12.649753
dtype: float64
Notice that we have only asked the mean
function to work on the numerical
columns - it would not make sense to ask for a mean for name
.
The code above is a repetitive and ugly - but it could have been much worse. Imagine, for example, that we had been interested in the mean age for passengers for each country. There are lots of countries; it would not be practical to type these all out by hand as we have above for the classes.
df['country'].value_counts()
country
England 323
United States 257
Ireland 119
Sweden 104
Lebanon 70
Finland 54
Canada 33
Norway 26
France 23
Belgium 22
Bulgaria 17
Switzerland 17
Scotland 16
Croatia (Modern) 12
Croatia 11
Channel Islands 10
Spain 8
Wales 8
Hungary 7
Denmark 7
India 7
Argentina 7
Turkey 6
South Africa 6
Germany 5
Italy 4
Bosnia 4
Slovenia 4
Poland 3
Austria 2
Russia 2
Greece 2
Siam 2
Peru 2
Australia 2
Uruguay 2
Syria 1
Japan 1
Egypt 1
China/Hong Kong 1
Latvia 1
Yugoslavia 1
Slovakia (Modern day) 1
Cuba 1
Mexico 1
Name: count, dtype: int64
To your complete lack of surprise, enter GroupBy, to our rescue.
We can ask the GroupBy object to aggregate over the rows in each group. In
this case we want to use the mean
method to give us an aggregate value for
the rows in each group.
As before, we only want to take a mean on the numerical columns fare
and
age
.
Here is one way to do that:
# Select fare and age columns, apply np.mean aggregate function.
by_class_fa = by_class[['fare', 'age']]
class_means = by_class_fa.agg(np.mean)
class_means
/tmp/ipykernel_5825/2708520383.py:3: FutureWarning: The provided callable <function mean at 0x7f8278334720> is currently using DataFrameGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
class_means = by_class_fa.agg(np.mean)
fare | age | |
---|---|---|
class | ||
1st | 84.256920 | 39.441584 |
2nd | 21.543087 | 30.193916 |
3rd | 12.649753 | 24.978557 |
Notice the benefits over doing this by hand, above. We have to do less typing.
We don’t have to know the unique values of class
beforehand. We get a Pandas
dataframe back, that displays nicely, and that we can index to get the values
we want:
# Label indexing of the groupby results table.
class_means.loc['1st', 'age']
39.441583610188374
In fact, through some particularly obscure and clever Python magic, we can use
mean
and other Pandas data frame methods as methods of the GroupBy object,
like this:
# Using mean function directly.
by_class_fa.mean()
fare | age | |
---|---|---|
class | ||
1st | 84.256920 | 39.441584 |
2nd | 21.543087 | 30.193916 |
3rd | 12.649753 | 24.978557 |
# Using median function directly.
by_class_fa.median()
fare | age | |
---|---|---|
class | ||
1st | 59.08 | 39.0 |
2nd | 15.01 | 29.0 |
3rd | 8.01 | 24.0 |
# Using sum function directly.
by_class_fa.sum()
fare | age | |
---|---|---|
class | ||
1st | 25361.3328 | 11871.916667 |
2nd | 5665.8318 | 7941.000000 |
3rd | 8209.6894 | 16211.083333 |
The function we are using can return more than one value, in which case, we get
a new column for each value. For example, the describe
method returns
various aggregate values for each column. Here’s describe
in action on the
main data frame:
df.describe()
age | fare | |
---|---|---|
count | 1213.000000 | 1213.000000 |
mean | 29.698269 | 32.346953 |
std | 14.036480 | 48.214773 |
min | 0.166667 | 3.030500 |
25% | 21.000000 | 7.180600 |
50% | 28.000000 | 14.100000 |
75% | 38.000000 | 31.050600 |
max | 74.000000 | 512.060700 |
When we apply describe
to the groups, we get a new column for each calculated
value that describe
returns.
by_class.describe()
age | fare | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | mean | std | min | 25% | 50% | 75% | max | count | mean | std | min | 25% | 50% | 75% | max | |
class | ||||||||||||||||
1st | 301.0 | 39.441584 | 13.881000 | 0.916667 | 29.0 | 39.0 | 49.0 | 71.0 | 301.0 | 84.256920 | 72.979218 | 5.0000 | 31.00 | 59.08 | 93.1000 | 512.0607 |
2nd | 263.0 | 30.193916 | 13.875730 | 0.583333 | 22.0 | 29.0 | 38.0 | 72.0 | 263.0 | 21.543087 | 13.555193 | 9.1309 | 13.00 | 15.01 | 26.0000 | 73.1000 |
3rd | 649.0 | 24.978557 | 11.616721 | 0.166667 | 19.0 | 24.0 | 31.0 | 74.0 | 649.0 | 12.649753 | 10.969757 | 3.0305 | 7.15 | 8.01 | 15.0411 | 69.1100 |
As you have already seen, we can use indexing on the GroupBy to select single
or multiple columns. Here is a new GroupBy object, that contains only the
age
column:
by_class['age']
<pandas.core.groupby.generic.SeriesGroupBy object at 0x7f823f583d90>
This can be useful when we want to restrict our attention to the age
column,
and we are not interested in other columns, such as fare
:
by_class['age'].mean()
class
1st 39.441584
2nd 30.193916
3rd 24.978557
Name: age, dtype: float64
We can get even more specific by passing keyword argument to the agg
method.
This allows us to say what aggregate columns we want back, and how to make
them. Here we ask for:
An aggregate column
mean_age
, the result of applyingnp.mean
to theage
column of the groups.Another aggregate column
median_fare
, the result of applyingnp.median
to thefare
column of the groups.
by_class.agg(mean_age=('age', np.mean),
median_fare=('fare', np.median))
/tmp/ipykernel_5825/3895194694.py:1: FutureWarning: The provided callable <function mean at 0x7f8278334720> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
by_class.agg(mean_age=('age', np.mean),
/tmp/ipykernel_5825/3895194694.py:1: FutureWarning: The provided callable <function median at 0x7f8278270ea0> is currently using SeriesGroupBy.median. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "median" instead.
by_class.agg(mean_age=('age', np.mean),
mean_age | median_fare | |
---|---|---|
class | ||
1st | 39.441584 | 59.08 |
2nd | 30.193916 | 15.01 |
3rd | 24.978557 | 8.01 |
If you prefer the Pandas versions of mean
and median
, you can use the names
of the function, as a string, rather than the functions themselves:
# Mean using Pandas mean instead of Numpy mean.
# Pandas mean ignores NaN values, for example.
by_class.agg(mean_age=('age', 'mean'),
median_fare=('fare', 'median'))
mean_age | median_fare | |
---|---|---|
class | ||
1st | 39.441584 | 59.08 |
2nd | 30.193916 | 15.01 |
3rd | 24.978557 | 8.01 |
Grouping by more than one column#
GroupBy can group by more than one set of categories.
For example, imagine we wanted to break down the rows of the data frame by
both gender
and class
. That is, we want to form groups that are all the
unique combinations of gender
and class. These will be:
female
and1st
female
and2nd
female
and3rd
male
and1st
male
and2nd
male
and3rd
We can do this by giving groupby
a list of the columns we want to categorize
by:
by_gender_class = df.groupby(['gender', 'class'])
by_gender_class
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f823f588390>
len(by_gender_class)
6
As before, GroupBy knows about each individual subgroup. For example, here we
fetch the rows corresponding to gender female
and class 3rd
:
female_third = by_gender_class.get_group(('female', '3rd'))
female_third.head()
name | gender | age | class | embarked | country | fare | survived | |
---|---|---|---|---|---|---|---|---|
3 | Abbott, Mrs. Rhoda Mary 'Rosa' | female | 39.0 | 3rd | Southampton | England | 20.0500 | yes |
4 | Abelseth, Miss. Karen Marie | female | 16.0 | 3rd | Southampton | Norway | 7.1300 | yes |
12 | Ahlin, Mrs. Johanna Persdotter | female | 40.0 | 3rd | Southampton | Sweden | 9.0906 | no |
14 | Aks, Mrs. Leah | female | 18.0 | 3rd | Southampton | England | 9.0700 | yes |
28 | Andersen-Jensen, Miss. Carla Christine Nielsine | female | 19.0 | 3rd | Southampton | Denmark | 7.1701 | yes |
Note that the group specification above has to be a tuple (surrounded by
()
) rather than a list (surrounded by []
).
As before, this gives us the same result as if we had done the selection the long way round:
female_df = df[df['gender'] == 'female']
female_3rd = female_df[female_df['class'] == '3rd']
female_3rd.head()
name | gender | age | class | embarked | country | fare | survived | |
---|---|---|---|---|---|---|---|---|
3 | Abbott, Mrs. Rhoda Mary 'Rosa' | female | 39.0 | 3rd | Southampton | England | 20.0500 | yes |
4 | Abelseth, Miss. Karen Marie | female | 16.0 | 3rd | Southampton | Norway | 7.1300 | yes |
12 | Ahlin, Mrs. Johanna Persdotter | female | 40.0 | 3rd | Southampton | Sweden | 9.0906 | no |
14 | Aks, Mrs. Leah | female | 18.0 | 3rd | Southampton | England | 9.0700 | yes |
28 | Andersen-Jensen, Miss. Carla Christine Nielsine | female | 19.0 | 3rd | Southampton | Denmark | 7.1701 | yes |
Our aggregation functions operate on all the subgroups:
by_gender_class[['age', 'fare']].mean()
age | fare | ||
---|---|---|---|
gender | class | ||
female | 1st | 37.015152 | 104.078811 |
2nd | 28.870915 | 23.021025 | |
3rd | 23.593800 | 14.752567 | |
male | 1st | 41.336785 | 68.774733 |
2nd | 31.032091 | 20.606753 | |
3rd | 25.627074 | 11.664950 |
Please see the Python Data Science Handbook chapter for much more detail.