The power of groupby#

Hide code cell content
import numpy as np
import pandas as pd
# Safe setting for Pandas.  Needs Pandas version >= 1.5.
pd.set_option('mode.copy_on_write', True)

This page follows from the Crosstab page.

As for the Crosstab page, we will use a dataset with data about passengers on the RMS Titanic.

See the Titanic dataset page for some more information about this dataset, and a link to download the data.

# Load the dataset
df = pd.read_csv('titanic_clean.csv')
df.head()
name gender age class embarked country fare survived
0 Abbing, Mr. Anthony male 42.0 3rd Southampton United States 7.11 no
1 Abbott, Mr. Eugene Joseph male 13.0 3rd Southampton United States 20.05 no
2 Abbott, Mr. Rossmore Edward male 16.0 3rd Southampton United States 20.05 no
3 Abbott, Mrs. Rhoda Mary 'Rosa' female 39.0 3rd Southampton England 20.05 yes
4 Abelseth, Miss. Karen Marie female 16.0 3rd Southampton Norway 7.13 yes

In the Crosstab page, we calculated counts and proportions of rows falling into categories given by Pandas Series. These Series are often columns from a Pandas data frame.

Pandas also has a very powerful mechanism for grouping observations by category, available by the Pandas data frame groupby method.

The transformations available with groupby are far more general than those in pd.crosstab. Among many other things, groupby allows us to calculate many different summaries of rows in different categories.

For much more detail on groupby, see the Python Data Science Handbook chapter, by Jake Vanderplas.

Just for example, imagine we were interested in the average age of the passengers in each class.

Groups and GroupBy#

We could start by making a Pandas groupby object, like this:

by_class = df.groupby('class')
by_class
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f823f5481d0>

The GroupBy object has categorized the rows according to the values in the class column. Therefore, the GroupBy object has three groups, one for 1st, one of 2nd and one for 3rd.

We can see how many groups we have with Python’s len function:

# The number of groups.
len(by_class)
3

If the fancy takes us, we can fetch the rows corresponding to the groups with the get_group method.

first_class_passengers = by_class.get_group('1st')
first_class_passengers.head()
name gender age class embarked country fare survived
20 Allen, Miss. Elisabeth Walton female 29.000000 1st Southampton United States 211.6009 yes
22 Allison, Master. Hudson Trevor male 0.916667 1st Southampton Canada 151.1600 yes
23 Allison, Miss. Helen Loraine female 2.000000 1st Southampton Canada 151.1600 no
24 Allison, Mr. Hudson Joshua Creighton male 30.000000 1st Southampton Canada 151.1600 no
25 Allison, Mrs. Bessie Waldo female 25.000000 1st Southampton United States 151.1600 no

This is exactly the same result we would get by selecting the first class passenger rows directly:

first_again = df[df['class'] == '1st']
first_again.head()
name gender age class embarked country fare survived
20 Allen, Miss. Elisabeth Walton female 29.000000 1st Southampton United States 211.6009 yes
22 Allison, Master. Hudson Trevor male 0.916667 1st Southampton Canada 151.1600 yes
23 Allison, Miss. Helen Loraine female 2.000000 1st Southampton Canada 151.1600 no
24 Allison, Mr. Hudson Joshua Creighton male 30.000000 1st Southampton Canada 151.1600 no
25 Allison, Mrs. Bessie Waldo female 25.000000 1st Southampton United States 151.1600 no

Indexing with column labels#

As for DataFrames, you can index a GroupBy with column labels, to select the given column(s) to work on later. As for DataFrames, this is direct indexing with column labels (DICL). For example, here we index the GroupBy to select the numerical columns age and fare:

by_class_fa = by_class[['age', 'fare']]
by_class_fa
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f8278d2f3d0>

Notice that the GroupBy now only contains the age and fare columns.

by_class_fa.get_group('1st')
age fare
20 29.000000 211.6009
22 0.916667 151.1600
23 2.000000 151.1600
24 30.000000 151.1600
25 25.000000 151.1600
... ... ...
1191 43.000000 35.0000
1194 21.000000 61.0707
1199 46.000000 35.1000
1201 62.000000 26.1100
1204 36.000000 135.1208

301 rows × 2 columns

You can also use index with a single column label, so the GroupBy only has a single column:

by_class_age = by_class['age']
by_class_age
<pandas.core.groupby.generic.SeriesGroupBy object at 0x7f823f55ce10>

Notice that the type of thing that came back has changed from a DataFrame GroupBy to a Series GroupBy, reflecting the fact we are working on a single column.

Aggregating#

The true power of GroupBy is in doing efficient operations on the groups, and returning the results.

Just for example, imagine that we wanted to get the mean age and fare for each class.

Here is how we could do that for the whole DataFrame:

df[['age', 'fare']].mean()
age     29.698269
fare    32.346953
dtype: float64

We could do this for each class by going the long way round For each class we identify the matching rows, and columns of interest, then take their mean.

print('1st mean', df.loc[df['class'] == '1st', ['age', 'fare']].mean())
print('2nd mean', df.loc[df['class'] == '2nd', ['age', 'fare']].mean())
print('3rd mean', df.loc[df['class'] == '3rd', ['age', 'fare']].mean())
1st mean age     39.441584
fare    84.256920
dtype: float64
2nd mean age     30.193916
fare    21.543087
dtype: float64
3rd mean age     24.978557
fare    12.649753
dtype: float64

Notice that we have only asked the mean function to work on the numerical columns - it would not make sense to ask for a mean for name.

The code above is a repetitive and ugly - but it could have been much worse. Imagine, for example, that we had been interested in the mean age for passengers for each country. There are lots of countries; it would not be practical to type these all out by hand as we have above for the classes.

df['country'].value_counts()
country
England                  323
United States            257
Ireland                  119
Sweden                   104
Lebanon                   70
Finland                   54
Canada                    33
Norway                    26
France                    23
Belgium                   22
Bulgaria                  17
Switzerland               17
Scotland                  16
Croatia (Modern)          12
Croatia                   11
Channel Islands           10
Spain                      8
Wales                      8
Hungary                    7
Denmark                    7
India                      7
Argentina                  7
Turkey                     6
South Africa               6
Germany                    5
Italy                      4
Bosnia                     4
Slovenia                   4
Poland                     3
Austria                    2
Russia                     2
Greece                     2
Siam                       2
Peru                       2
Australia                  2
Uruguay                    2
Syria                      1
Japan                      1
Egypt                      1
China/Hong Kong            1
Latvia                     1
Yugoslavia                 1
Slovakia (Modern day)      1
Cuba                       1
Mexico                     1
Name: count, dtype: int64

To your complete lack of surprise, enter GroupBy, to our rescue.

We can ask the GroupBy object to aggregate over the rows in each group. In this case we want to use the mean method to give us an aggregate value for the rows in each group.

As before, we only want to take a mean on the numerical columns fare and age.

Here is one way to do that:

# Select fare and age columns, apply np.mean aggregate function.
by_class_fa = by_class[['fare', 'age']]
class_means = by_class_fa.agg(np.mean)
class_means
/tmp/ipykernel_5825/2708520383.py:3: FutureWarning: The provided callable <function mean at 0x7f8278334720> is currently using DataFrameGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
  class_means = by_class_fa.agg(np.mean)
fare age
class
1st 84.256920 39.441584
2nd 21.543087 30.193916
3rd 12.649753 24.978557

Notice the benefits over doing this by hand, above. We have to do less typing. We don’t have to know the unique values of class beforehand. We get a Pandas dataframe back, that displays nicely, and that we can index to get the values we want:

# Label indexing of the groupby results table.
class_means.loc['1st', 'age']
39.441583610188374

In fact, through some particularly obscure and clever Python magic, we can use mean and other Pandas data frame methods as methods of the GroupBy object, like this:

# Using mean function directly.
by_class_fa.mean()
fare age
class
1st 84.256920 39.441584
2nd 21.543087 30.193916
3rd 12.649753 24.978557
# Using median function directly.
by_class_fa.median()
fare age
class
1st 59.08 39.0
2nd 15.01 29.0
3rd 8.01 24.0
# Using sum function directly.
by_class_fa.sum()
fare age
class
1st 25361.3328 11871.916667
2nd 5665.8318 7941.000000
3rd 8209.6894 16211.083333

The function we are using can return more than one value, in which case, we get a new column for each value. For example, the describe method returns various aggregate values for each column. Here’s describe in action on the main data frame:

df.describe()
age fare
count 1213.000000 1213.000000
mean 29.698269 32.346953
std 14.036480 48.214773
min 0.166667 3.030500
25% 21.000000 7.180600
50% 28.000000 14.100000
75% 38.000000 31.050600
max 74.000000 512.060700

When we apply describe to the groups, we get a new column for each calculated value that describe returns.

by_class.describe()
age fare
count mean std min 25% 50% 75% max count mean std min 25% 50% 75% max
class
1st 301.0 39.441584 13.881000 0.916667 29.0 39.0 49.0 71.0 301.0 84.256920 72.979218 5.0000 31.00 59.08 93.1000 512.0607
2nd 263.0 30.193916 13.875730 0.583333 22.0 29.0 38.0 72.0 263.0 21.543087 13.555193 9.1309 13.00 15.01 26.0000 73.1000
3rd 649.0 24.978557 11.616721 0.166667 19.0 24.0 31.0 74.0 649.0 12.649753 10.969757 3.0305 7.15 8.01 15.0411 69.1100

As you have already seen, we can use indexing on the GroupBy to select single or multiple columns. Here is a new GroupBy object, that contains only the age column:

by_class['age']
<pandas.core.groupby.generic.SeriesGroupBy object at 0x7f823f583d90>

This can be useful when we want to restrict our attention to the age column, and we are not interested in other columns, such as fare:

by_class['age'].mean()
class
1st    39.441584
2nd    30.193916
3rd    24.978557
Name: age, dtype: float64

We can get even more specific by passing keyword argument to the agg method. This allows us to say what aggregate columns we want back, and how to make them. Here we ask for:

  • An aggregate column mean_age, the result of applying np.mean to the age column of the groups.

  • Another aggregate column median_fare, the result of applying np.median to the fare column of the groups.

by_class.agg(mean_age=('age', np.mean),
             median_fare=('fare', np.median))
/tmp/ipykernel_5825/3895194694.py:1: FutureWarning: The provided callable <function mean at 0x7f8278334720> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
  by_class.agg(mean_age=('age', np.mean),
/tmp/ipykernel_5825/3895194694.py:1: FutureWarning: The provided callable <function median at 0x7f8278270ea0> is currently using SeriesGroupBy.median. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "median" instead.
  by_class.agg(mean_age=('age', np.mean),
mean_age median_fare
class
1st 39.441584 59.08
2nd 30.193916 15.01
3rd 24.978557 8.01

If you prefer the Pandas versions of mean and median, you can use the names of the function, as a string, rather than the functions themselves:

# Mean using Pandas mean instead of Numpy mean.
# Pandas mean ignores NaN values, for example.
by_class.agg(mean_age=('age', 'mean'),
             median_fare=('fare', 'median'))
mean_age median_fare
class
1st 39.441584 59.08
2nd 30.193916 15.01
3rd 24.978557 8.01

Grouping by more than one column#

GroupBy can group by more than one set of categories.

For example, imagine we wanted to break down the rows of the data frame by both gender and class. That is, we want to form groups that are all the unique combinations of gender and class. These will be:

  • female and 1st

  • female and 2nd

  • female and 3rd

  • male and 1st

  • male and 2nd

  • male and 3rd

We can do this by giving groupby a list of the columns we want to categorize by:

by_gender_class = df.groupby(['gender', 'class'])
by_gender_class
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f823f588390>
len(by_gender_class)
6

As before, GroupBy knows about each individual subgroup. For example, here we fetch the rows corresponding to gender female and class 3rd:

female_third = by_gender_class.get_group(('female', '3rd'))
female_third.head()
name gender age class embarked country fare survived
3 Abbott, Mrs. Rhoda Mary 'Rosa' female 39.0 3rd Southampton England 20.0500 yes
4 Abelseth, Miss. Karen Marie female 16.0 3rd Southampton Norway 7.1300 yes
12 Ahlin, Mrs. Johanna Persdotter female 40.0 3rd Southampton Sweden 9.0906 no
14 Aks, Mrs. Leah female 18.0 3rd Southampton England 9.0700 yes
28 Andersen-Jensen, Miss. Carla Christine Nielsine female 19.0 3rd Southampton Denmark 7.1701 yes

Note that the group specification above has to be a tuple (surrounded by ()) rather than a list (surrounded by []).

As before, this gives us the same result as if we had done the selection the long way round:

female_df = df[df['gender'] == 'female']
female_3rd = female_df[female_df['class'] == '3rd']
female_3rd.head()
name gender age class embarked country fare survived
3 Abbott, Mrs. Rhoda Mary 'Rosa' female 39.0 3rd Southampton England 20.0500 yes
4 Abelseth, Miss. Karen Marie female 16.0 3rd Southampton Norway 7.1300 yes
12 Ahlin, Mrs. Johanna Persdotter female 40.0 3rd Southampton Sweden 9.0906 no
14 Aks, Mrs. Leah female 18.0 3rd Southampton England 9.0700 yes
28 Andersen-Jensen, Miss. Carla Christine Nielsine female 19.0 3rd Southampton Denmark 7.1701 yes

Our aggregation functions operate on all the subgroups:

by_gender_class[['age', 'fare']].mean()
age fare
gender class
female 1st 37.015152 104.078811
2nd 28.870915 23.021025
3rd 23.593800 14.752567
male 1st 41.336785 68.774733
2nd 31.032091 20.606753
3rd 25.627074 11.664950

Please see the Python Data Science Handbook chapter for much more detail.