Apply functions#
Functions are named recipes. We often find that we want to apply such a recipe to each value in a Pandas Series.
This is the job of the apply
method of a Series.
To start, we load the familiar Ratings data, giving the average ratings on various measures for all rated professors teaching a subject, such as English, mathematics, and so on.
import numpy as np
import pandas as pd
# Safe setting for using Pandas.
pd.set_option('mode.copy_on_write', True)
ratings = pd.read_csv('rate_my_course.csv')
ratings.head()
Discipline | Number of Professors | Clarity | Helpfulness | Overall Quality | Easiness | |
---|---|---|---|---|---|---|
0 | English | 23343 | 3.756147 | 3.821866 | 3.791364 | 3.162754 |
1 | Mathematics | 22394 | 3.487379 | 3.641526 | 3.566867 | 3.063322 |
2 | Biology | 11774 | 3.608331 | 3.701530 | 3.657641 | 2.710459 |
3 | Psychology | 11179 | 3.909520 | 3.887536 | 3.900949 | 3.316210 |
4 | History | 11145 | 3.788818 | 3.753642 | 3.773746 | 3.053803 |
There are 75 rows in this dataset, one row for each of 75 subjects:
n = len(ratings)
n
75
Now let us say that we are interested in the Easiness
ratings:
easiness = ratings['Easiness']
easiness
0 3.162754
1 3.063322
2 2.710459
3 3.316210
4 3.053803
...
70 2.863504
71 3.106727
72 3.309636
73 2.799135
74 3.109118
Name: Easiness, Length: 75, dtype: float64
We decide we want to classify each subject into one of three groups:
“Easy” for courses that have
Easiness
scores above the 75% percentile forEasiness
.“Hard” for courses below the 25% percentile.
“Medium” for courses between the 25% and 75% percentile.
You might first wonder how to get the percentiles. One way to
do it, is to use the percentile
function from Numpy. For
example, the median is the value at the 50% percentile, meaning
that half the values are below the median and half are above
(well, it’s a little more complicated than that, but that’s right
to a first pass).
Here’s the median:
np.median(easiness)
3.19430041152263
This is, by definition, the 50% percentile:
np.percentile(easiness, 50)
3.19430041152263
Here are the 25% and 75% percentiles:
easy_25 = np.percentile(easiness, 25)
print('25% percentile is', easy_25)
easy_75 = np.percentile(easiness, 75)
print('75% percentile is', easy_75)
25% percentile is 3.0283298724604153
75% percentile is 3.34694063174731
We can then write a function that we send a value (call the value
v
), and return the classification, using the easy_25
and
easy_75
values.
Notice that the function can see the easy_25
and easy_75
values in the top-level workspace:
def classify_easy(v):
if v < easy_25:
return 'Hard'
if v > easy_75:
return 'Easy'
return 'Medium'
We expect this function to return “Hard” for a value of 3 (it’s below the 25% percentile):
classify_easy(3)
'Hard'
It should return “Easy” for a value of 3.4 (it’s above the 75% percentile):
classify_easy(3.4)
'Easy'
An intermediate value should give “Medium”:
classify_easy(3.2)
'Medium'
Now let us imagine we want to apply this function to all the values in the easiness
series.
We could do this laboriously, by making an array to store the values, like this:
classified = np.repeat(['Unknown'], n)
classified[:10]
array(['Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown',
'Unknown', 'Unknown', 'Unknown', 'Unknown'], dtype='<U7')
This is an array with 75 values of the string “Unknown”. We are going to replace the “Unknown” values with the classifications from calling classify_easy
on each of the easiness
scores in turn, like this:
# The long way to apply classify_easy to easiness.
for i in np.arange(n):
# Get the easiness value at this position.
value = easiness.iloc[i]
# Call `classify_easy` on this value, put it into
# the `classified` array.
classified[i] = classify_easy(value)
classified[:10]
array(['Medium', 'Medium', 'Hard', 'Medium', 'Medium', 'Hard', 'Easy',
'Medium', 'Medium', 'Hard'], dtype='<U7')
That is the long way to do that task, because Pandas Series have an apply
method to do just that. We send the apply
method the function we want to apply on each value in the Series, and it calls the function on each value, and returns the result, as a new Series:
classified_series = easiness.apply(classify_easy)
classified_series
0 Medium
1 Medium
2 Hard
3 Medium
4 Medium
...
70 Hard
71 Medium
72 Medium
73 Hard
74 Medium
Name: Easiness, Length: 75, dtype: object
To see the results, we might make a new DataFrame to show the
original Easiness
scores and the classification side-by-side.
df = pd.DataFrame()
df['Easiness'] = easiness
df['Easiness group'] = classified_series
df
Easiness | Easiness group | |
---|---|---|
0 | 3.162754 | Medium |
1 | 3.063322 | Medium |
2 | 2.710459 | Hard |
3 | 3.316210 | Medium |
4 | 3.053803 | Medium |
... | ... | ... |
70 | 2.863504 | Hard |
71 | 3.106727 | Medium |
72 | 3.309636 | Medium |
73 | 2.799135 | Hard |
74 | 3.109118 | Medium |
75 rows × 2 columns
apply
and DataFrames#
In fact, DataFrames also have an apply
method, that does a
similar thing to the apply
method of a Series, but on a
DataFrame.
Remember, the apply
method of the Series calls the supplied function on each
value in the Series.
The apply
function of a DataFrame calls the supplied function
on each row or each column of the DataFrame.
You can specify whether you want to apply the function to each row, or to each
column, with the axis
keyword argument to the apply
method.
This is easier to see in practice, than to describe. We will practice
DataFrame .apply
on a famous dataset collected by Sir Francis Galton.
Galton’s dataset#
The data we will use relates to a famous paper by Francis Galton, published in 1886. Galton was a versatile scientist who laid the groundwork for early statistics, and particularly regression and correlation. The paper we are interested in here is:
Galton, F. (1886). Regression Towards Mediocrity in Hereditary Stature Journal of the Anthropological Institute, 15, 246-263
In fact, this paper is the origin of the term regression for fitting prediction lines to data.
Galton was a keen eugenicist, and was very interested in inheritance. In this case he studied the relationship of children’s heights to the heights of their parents.
Galton asked families to give him data about:
The father’s height
The mother’s height
The height and gender of each adult child in the family.
You can read more about the data files at the Galton heights datasets page.
The galton_combined.csv
file has the data Galton used in his paper:
galton = pd.read_csv('galton_combined.csv')
galton.head()
family | father | mother | midparentHeight | children | childNum | gender | childHeight | |
---|---|---|---|---|---|---|---|---|
0 | 001 | 78.5 | 67.0 | 75.43 | 4 | 1 | male | 73.2 |
1 | 001 | 78.5 | 67.0 | 75.43 | 4 | 2 | female | 69.2 |
2 | 001 | 78.5 | 67.0 | 75.43 | 4 | 3 | female | 69.0 |
3 | 001 | 78.5 | 67.0 | 75.43 | 4 | 4 | female | 69.0 |
4 | 002 | 75.5 | 66.5 | 73.66 | 4 | 1 | male | 73.5 |
Each row is one child. For each child we have their father’s height, their mother’s height, the child’s gender, and the child’s height, among other values.
All heights are in inches.
DataFrame apply
#
Like Galton, we are interested in the heritability of height. For example, we may be interested in the difference between the height of the parents and the height of the children. To do this, we may want to subtract the parents height from the children’s height, to get a height difference.
One factor we have to take into account is that males are taller, on average, than females. A very crude way to adjust for this is to subtract the mother’s height from the height of the female children, and the father’s height from the height of the male children.
For example, here is the first row:
first_row = galton.iloc[0]
first_row
family 001
father 78.5
mother 67.0
midparentHeight 75.43
children 4
childNum 1
gender male
childHeight 73.2
Name: 0, dtype: object
This is a male child, so the difference we want is:
first_difference = first_row['childHeight'] - first_row['father']
first_difference
-5.299999999999997
Here is the second row:
second_row = galton.iloc[1]
second_row
family 001
father 78.5
mother 67.0
midparentHeight 75.43
children 4
childNum 2
gender female
childHeight 69.2
Name: 1, dtype: object
The difference we want for this female child is:
second_difference = second_row['childHeight'] - second_row['mother']
second_difference
2.200000000000003
To do this calculation for any given row, we could make a function that accepts a row as its argument, and does the calculation. It might look like this:
def apply_on_row(row):
if row['gender'] == 'female':
return row['childHeight'] - row['mother']
elif row['gender'] == 'male':
return row['childHeight'] - row['father']
# If neither female or male, we get here.
return None
Here is that function applied to the first row. As expected, it gives the same value we calculated above:
apply_on_row(galton.iloc[0])
-5.299999999999997
Here is the function applied to the second row:
apply_on_row(galton.iloc[1])
2.200000000000003
We can apply this function to every row, returning a Series, by using the
DataFrame apply
function with axis=1
. axis=1
means apply the function to
each value selected across the second axis, that is, across the columns. When
we ask for a value across the columns, the function will get one row at a time
— one row consists of all the columns values for one row.
When we .apply
the function, it returns a new Series, where each value is the
result of applying the apply_on_row
function to one row:
subtracted_height = galton.apply(apply_on_row, axis=1)
subtracted_height
0 -5.3
1 2.2
2 2.0
3 2.0
4 -2.0
...
929 2.0
930 -4.0
931 -5.0
932 4.0
933 -6.0
Length: 934, dtype: float64
We could put this calculation back into the original DataFrame as a new column:
galton['height_diff'] = subtracted_height
galton.head()
family | father | mother | midparentHeight | children | childNum | gender | childHeight | height_diff | |
---|---|---|---|---|---|---|---|---|---|
0 | 001 | 78.5 | 67.0 | 75.43 | 4 | 1 | male | 73.2 | -5.3 |
1 | 001 | 78.5 | 67.0 | 75.43 | 4 | 2 | female | 69.2 | 2.2 |
2 | 001 | 78.5 | 67.0 | 75.43 | 4 | 3 | female | 69.0 | 2.0 |
3 | 001 | 78.5 | 67.0 | 75.43 | 4 | 4 | female | 69.0 | 2.0 |
4 | 002 | 75.5 | 66.5 | 73.66 | 4 | 1 | male | 73.5 | -2.0 |