The power of groupby

The power of groupby#

This page follows from the Crosstab page.

As for the Crosstab page, we will use a dataset with data about passengers on the RMS Titanic.

See the Titanic dataset page for some more information about this dataset, and a link to download the data.

# Load the dataset
df = pd.read_csv('titanic_clean.csv')
df.head()

	name	gender	age	class	embarked	country	fare	survived
0	Abbing, Mr. Anthony	male	42.0	3rd	Southampton	United States	7.11	no
1	Abbott, Mr. Eugene Joseph	male	13.0	3rd	Southampton	United States	20.05	no
2	Abbott, Mr. Rossmore Edward	male	16.0	3rd	Southampton	United States	20.05	no
3	Abbott, Mrs. Rhoda Mary 'Rosa'	female	39.0	3rd	Southampton	England	20.05	yes
4	Abelseth, Miss. Karen Marie	female	16.0	3rd	Southampton	Norway	7.13	yes

In the Crosstab page, we calculated counts and proportions of rows falling into categories given by Pandas Series. These Series are often columns from a Pandas data frame.

Pandas also has a very powerful mechanism for grouping observations by category, available by the Pandas data frame groupby method.

The transformations available with groupby are far more general than those in pd.crosstab. Among many other things, groupby allows us to calculate many different summaries of rows in different categories.

For much more detail on groupby, see the Python Data Science Handbook chapter, by Jake Vanderplas.

Just for example, imagine we were interested in the average age of the passengers in each class.

Groups and GroupBy#

We could start by making a Pandas groupby object, like this:

by_class = df.groupby('class')
by_class

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f823f5481d0>

The GroupBy object has categorized the rows according to the values in the class column. Therefore, the GroupBy object has three groups, one for 1st, one of 2nd and one for 3rd.

We can see how many groups we have with Python’s len function:

# The number of groups.
len(by_class)

If the fancy takes us, we can fetch the rows corresponding to the groups with the get_group method.

first_class_passengers = by_class.get_group('1st')
first_class_passengers.head()

	name	gender	age	class	embarked	country	fare	survived
20	Allen, Miss. Elisabeth Walton	female	29.000000	1st	Southampton	United States	211.6009	yes
22	Allison, Master. Hudson Trevor	male	0.916667	1st	Southampton	Canada	151.1600	yes
23	Allison, Miss. Helen Loraine	female	2.000000	1st	Southampton	Canada	151.1600	no
24	Allison, Mr. Hudson Joshua Creighton	male	30.000000	1st	Southampton	Canada	151.1600	no
25	Allison, Mrs. Bessie Waldo	female	25.000000	1st	Southampton	United States	151.1600	no

This is exactly the same result we would get by selecting the first class passenger rows directly:

first_again = df[df['class'] == '1st']
first_again.head()

	name	gender	age	class	embarked	country	fare	survived
20	Allen, Miss. Elisabeth Walton	female	29.000000	1st	Southampton	United States	211.6009	yes
22	Allison, Master. Hudson Trevor	male	0.916667	1st	Southampton	Canada	151.1600	yes
23	Allison, Miss. Helen Loraine	female	2.000000	1st	Southampton	Canada	151.1600	no
24	Allison, Mr. Hudson Joshua Creighton	male	30.000000	1st	Southampton	Canada	151.1600	no
25	Allison, Mrs. Bessie Waldo	female	25.000000	1st	Southampton	United States	151.1600	no

Indexing with column labels#

As for DataFrames, you can index a GroupBy with column labels, to select the given column(s) to work on later. As for DataFrames, this is direct indexing with column labels (DICL). For example, here we index the GroupBy to select the numerical columns age and fare:

by_class_fa = by_class[['age', 'fare']]
by_class_fa

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f8278d2f3d0>

Notice that the GroupBy now only contains the age and fare columns.

by_class_fa.get_group('1st')

	age	fare
20	29.000000	211.6009
22	0.916667	151.1600
23	2.000000	151.1600
24	30.000000	151.1600
25	25.000000	151.1600
...	...	...
1191	43.000000	35.0000
1194	21.000000	61.0707
1199	46.000000	35.1000
1201	62.000000	26.1100
1204	36.000000	135.1208

301 rows × 2 columns

You can also use index with a single column label, so the GroupBy only has a single column:

by_class_age = by_class['age']
by_class_age

<pandas.core.groupby.generic.SeriesGroupBy object at 0x7f823f55ce10>

Notice that the type of thing that came back has changed from a DataFrame GroupBy to a Series GroupBy, reflecting the fact we are working on a single column.

Aggregating#

The true power of GroupBy is in doing efficient operations on the groups, and returning the results.

Just for example, imagine that we wanted to get the mean age and fare for each class.

Here is how we could do that for the whole DataFrame:

df[['age', 'fare']].mean()

age     29.698269
fare    32.346953
dtype: float64

We could do this for each class by going the long way round For each class we identify the matching rows, and columns of interest, then take their mean.

print('1st mean', df.loc[df['class'] == '1st', ['age', 'fare']].mean())
print('2nd mean', df.loc[df['class'] == '2nd', ['age', 'fare']].mean())
print('3rd mean', df.loc[df['class'] == '3rd', ['age', 'fare']].mean())

1st mean age     39.441584
fare    84.256920
dtype: float64
2nd mean age     30.193916
fare    21.543087
dtype: float64
3rd mean age     24.978557
fare    12.649753
dtype: float64

Notice that we have only asked the mean function to work on the numerical columns - it would not make sense to ask for a mean for name.

The code above is a repetitive and ugly - but it could have been much worse. Imagine, for example, that we had been interested in the mean age for passengers for each country. There are lots of countries; it would not be practical to type these all out by hand as we have above for the classes.

df['country'].value_counts()

country
England                  323
United States            257
Ireland                  119
Sweden                   104
Lebanon                   70
Finland                   54
Canada                    33
Norway                    26
France                    23
Belgium                   22
Bulgaria                  17
Switzerland               17
Scotland                  16
Croatia (Modern)          12
Croatia                   11
Channel Islands           10
Spain                      8
Wales                      8
Hungary                    7
Denmark                    7
India                      7
Argentina                  7
Turkey                     6
South Africa               6
Germany                    5
Italy                      4
Bosnia                     4
Slovenia                   4
Poland                     3
Austria                    2
Russia                     2
Greece                     2
Siam                       2
Peru                       2
Australia                  2
Uruguay                    2
Syria                      1
Japan                      1
Egypt                      1
China/Hong Kong            1
Latvia                     1
Yugoslavia                 1
Slovakia (Modern day)      1
Cuba                       1
Mexico                     1
Name: count, dtype: int64

To your complete lack of surprise, enter GroupBy, to our rescue.

We can ask the GroupBy object to aggregate over the rows in each group. In this case we want to use the mean method to give us an aggregate value for the rows in each group.

As before, we only want to take a mean on the numerical columns fare and age.

Here is one way to do that:

# Select fare and age columns, apply np.mean aggregate function.
by_class_fa = by_class[['fare', 'age']]
class_means = by_class_fa.agg(np.mean)
class_means

/tmp/ipykernel_5825/2708520383.py:3: FutureWarning: The provided callable <function mean at 0x7f8278334720> is currently using DataFrameGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
  class_means = by_class_fa.agg(np.mean)

	fare	age
class
1st	84.256920	39.441584
2nd	21.543087	30.193916
3rd	12.649753	24.978557

Notice the benefits over doing this by hand, above. We have to do less typing. We don’t have to know the unique values of class beforehand. We get a Pandas dataframe back, that displays nicely, and that we can index to get the values we want:

# Label indexing of the groupby results table.
class_means.loc['1st', 'age']

39.441583610188374

In fact, through some particularly obscure and clever Python magic, we can use mean and other Pandas data frame methods as methods of the GroupBy object, like this:

# Using mean function directly.
by_class_fa.mean()

	fare	age
class
1st	84.256920	39.441584
2nd	21.543087	30.193916
3rd	12.649753	24.978557

# Using median function directly.
by_class_fa.median()

	fare	age
class
1st	59.08	39.0
2nd	15.01	29.0
3rd	8.01	24.0

# Using sum function directly.
by_class_fa.sum()

	fare	age
class
1st	25361.3328	11871.916667
2nd	5665.8318	7941.000000
3rd	8209.6894	16211.083333

The function we are using can return more than one value, in which case, we get a new column for each value. For example, the describe method returns various aggregate values for each column. Here’s describe in action on the main data frame:

df.describe()

	age	fare
count	1213.000000	1213.000000
mean	29.698269	32.346953
std	14.036480	48.214773
min	0.166667	3.030500
25%	21.000000	7.180600
50%	28.000000	14.100000
75%	38.000000	31.050600
max	74.000000	512.060700

When we apply describe to the groups, we get a new column for each calculated value that describe returns.

by_class.describe()

	age								fare
	count	mean	std	min	25%	50%	75%	max	count	mean	std	min	25%	50%	75%	max
class
1st	301.0	39.441584	13.881000	0.916667	29.0	39.0	49.0	71.0	301.0	84.256920	72.979218	5.0000	31.00	59.08	93.1000	512.0607
2nd	263.0	30.193916	13.875730	0.583333	22.0	29.0	38.0	72.0	263.0	21.543087	13.555193	9.1309	13.00	15.01	26.0000	73.1000
3rd	649.0	24.978557	11.616721	0.166667	19.0	24.0	31.0	74.0	649.0	12.649753	10.969757	3.0305	7.15	8.01	15.0411	69.1100

As you have already seen, we can use indexing on the GroupBy to select single or multiple columns. Here is a new GroupBy object, that contains only the age column:

by_class['age']

<pandas.core.groupby.generic.SeriesGroupBy object at 0x7f823f583d90>

This can be useful when we want to restrict our attention to the age column, and we are not interested in other columns, such as fare:

by_class['age'].mean()

class
1st    39.441584
2nd    30.193916
3rd    24.978557
Name: age, dtype: float64

We can get even more specific by passing keyword argument to the agg method. This allows us to say what aggregate columns we want back, and how to make them. Here we ask for:

An aggregate column mean_age, the result of applying np.mean to the age column of the groups.
Another aggregate column median_fare, the result of applying np.median to the fare column of the groups.

by_class.agg(mean_age=('age', np.mean),
             median_fare=('fare', np.median))

/tmp/ipykernel_5825/3895194694.py:1: FutureWarning: The provided callable <function mean at 0x7f8278334720> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
  by_class.agg(mean_age=('age', np.mean),
/tmp/ipykernel_5825/3895194694.py:1: FutureWarning: The provided callable <function median at 0x7f8278270ea0> is currently using SeriesGroupBy.median. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "median" instead.
  by_class.agg(mean_age=('age', np.mean),

	mean_age	median_fare
class
1st	39.441584	59.08
2nd	30.193916	15.01
3rd	24.978557	8.01

If you prefer the Pandas versions of mean and median, you can use the names of the function, as a string, rather than the functions themselves:

# Mean using Pandas mean instead of Numpy mean.
# Pandas mean ignores NaN values, for example.
by_class.agg(mean_age=('age', 'mean'),
             median_fare=('fare', 'median'))

	mean_age	median_fare
class
1st	39.441584	59.08
2nd	30.193916	15.01
3rd	24.978557	8.01

Grouping by more than one column#

GroupBy can group by more than one set of categories.

For example, imagine we wanted to break down the rows of the data frame by both gender and class. That is, we want to form groups that are all the unique combinations of gender and class. These will be:

female and 1st
female and 2nd
female and 3rd
male and 1st
male and 2nd
male and 3rd

We can do this by giving groupby a list of the columns we want to categorize by:

by_gender_class = df.groupby(['gender', 'class'])
by_gender_class

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f823f588390>

len(by_gender_class)

As before, GroupBy knows about each individual subgroup. For example, here we fetch the rows corresponding to gender female and class 3rd:

female_third = by_gender_class.get_group(('female', '3rd'))
female_third.head()

	name	gender	age	class	embarked	country	fare	survived
3	Abbott, Mrs. Rhoda Mary 'Rosa'	female	39.0	3rd	Southampton	England	20.0500	yes
4	Abelseth, Miss. Karen Marie	female	16.0	3rd	Southampton	Norway	7.1300	yes
12	Ahlin, Mrs. Johanna Persdotter	female	40.0	3rd	Southampton	Sweden	9.0906	no
14	Aks, Mrs. Leah	female	18.0	3rd	Southampton	England	9.0700	yes
28	Andersen-Jensen, Miss. Carla Christine Nielsine	female	19.0	3rd	Southampton	Denmark	7.1701	yes

Note that the group specification above has to be a tuple (surrounded by ()) rather than a list (surrounded by []).

As before, this gives us the same result as if we had done the selection the long way round:

female_df = df[df['gender'] == 'female']
female_3rd = female_df[female_df['class'] == '3rd']
female_3rd.head()

	name	gender	age	class	embarked	country	fare	survived
3	Abbott, Mrs. Rhoda Mary 'Rosa'	female	39.0	3rd	Southampton	England	20.0500	yes
4	Abelseth, Miss. Karen Marie	female	16.0	3rd	Southampton	Norway	7.1300	yes
12	Ahlin, Mrs. Johanna Persdotter	female	40.0	3rd	Southampton	Sweden	9.0906	no
14	Aks, Mrs. Leah	female	18.0	3rd	Southampton	England	9.0700	yes
28	Andersen-Jensen, Miss. Carla Christine Nielsine	female	19.0	3rd	Southampton	Denmark	7.1701	yes

Our aggregation functions operate on all the subgroups:

by_gender_class[['age', 'fare']].mean()

		age	fare
gender	class
female	1st	37.015152	104.078811
	2nd	28.870915	23.021025
	3rd	23.593800	14.752567
male	1st	41.336785	68.774733
	2nd	31.032091	20.606753
	3rd	25.627074	11.664950

Please see the Python Data Science Handbook chapter for much more detail.