Apply functions#

Functions are named recipes. We often find that we want to apply such a recipe to each value in a Pandas Series.

This is the job of the apply method of a Series.

To start, we load the familiar Ratings data, giving the average ratings on various measures for all rated professors teaching a subject, such as English, mathematics, and so on.

import numpy as np
import pandas as pd

# Safe setting for using Pandas.
pd.set_option('mode.copy_on_write', True)
ratings = pd.read_csv('rate_my_course.csv')
ratings.head()
Discipline Number of Professors Clarity Helpfulness Overall Quality Easiness
0 English 23343 3.756147 3.821866 3.791364 3.162754
1 Mathematics 22394 3.487379 3.641526 3.566867 3.063322
2 Biology 11774 3.608331 3.701530 3.657641 2.710459
3 Psychology 11179 3.909520 3.887536 3.900949 3.316210
4 History 11145 3.788818 3.753642 3.773746 3.053803

There are 75 rows in this dataset, one row for each of 75 subjects:

n = len(ratings)
n
75

Now let us say that we are interested in the Easiness ratings:

easiness = ratings['Easiness']
easiness
0     3.162754
1     3.063322
2     2.710459
3     3.316210
4     3.053803
        ...   
70    2.863504
71    3.106727
72    3.309636
73    2.799135
74    3.109118
Name: Easiness, Length: 75, dtype: float64

We decide we want to classify each subject into one of three groups:

  • “Easy” for courses that have Easiness scores above the 75% percentile for Easiness.

  • “Hard” for courses below the 25% percentile.

  • “Medium” for courses between the 25% and 75% percentile.

You might first wonder how to get the percentiles. One way to do it, is to use the percentile function from Numpy. For example, the median is the value at the 50% percentile, meaning that half the values are below the median and half are above (well, it’s a little more complicated than that, but that’s right to a first pass).

Here’s the median:

np.median(easiness)
3.19430041152263

This is, by definition, the 50% percentile:

np.percentile(easiness, 50)
3.19430041152263

Here are the 25% and 75% percentiles:

easy_25 = np.percentile(easiness, 25)
print('25% percentile is', easy_25)
easy_75 = np.percentile(easiness, 75)
print('75% percentile is', easy_75)
25% percentile is 3.0283298724604153
75% percentile is 3.34694063174731

We can then write a function that we send a value (call the value v), and return the classification, using the easy_25 and easy_75 values.

Notice that the function can see the easy_25 and easy_75 values in the top-level workspace:

def classify_easy(v):
    if v < easy_25:
        return 'Hard'
    if v > easy_75:
        return 'Easy'
    return 'Medium'

We expect this function to return “Hard” for a value of 3 (it’s below the 25% percentile):

classify_easy(3)
'Hard'

It should return “Easy” for a value of 3.4 (it’s above the 75% percentile):

classify_easy(3.4)
'Easy'

An intermediate value should give “Medium”:

classify_easy(3.2)
'Medium'

Now let us imagine we want to apply this function to all the values in the easiness series.

We could do this laboriously, by making an array to store the values, like this:

classified = np.repeat(['Unknown'], n)
classified[:10]
array(['Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown',
       'Unknown', 'Unknown', 'Unknown', 'Unknown'], dtype='<U7')

This is an array with 75 values of the string “Unknown”. We are going to replace the “Unknown” values with the classifications from calling classify_easy on each of the easiness scores in turn, like this:

# The long way to apply classify_easy to easiness.
for i in np.arange(n):
    # Get the easiness value at this position.
    value = easiness.iloc[i]
    # Call `classify_easy` on this value, put it into
    # the `classified` array.
    classified[i] = classify_easy(value)
classified[:10]
array(['Medium', 'Medium', 'Hard', 'Medium', 'Medium', 'Hard', 'Easy',
       'Medium', 'Medium', 'Hard'], dtype='<U7')

That is the long way to do that task, because Pandas Series have an apply method to do just that. We send the apply method the function we want to apply on each value in the Series, and it calls the function on each value, and returns the result, as a new Series:

classified_series = easiness.apply(classify_easy)
classified_series
0     Medium
1     Medium
2       Hard
3     Medium
4     Medium
       ...  
70      Hard
71    Medium
72    Medium
73      Hard
74    Medium
Name: Easiness, Length: 75, dtype: object

To see the results, we might make a new DataFrame to show the original Easiness scores and the classification side-by-side.

df = pd.DataFrame()
df['Easiness'] = easiness
df['Easiness group'] = classified_series
df
Easiness Easiness group
0 3.162754 Medium
1 3.063322 Medium
2 2.710459 Hard
3 3.316210 Medium
4 3.053803 Medium
... ... ...
70 2.863504 Hard
71 3.106727 Medium
72 3.309636 Medium
73 2.799135 Hard
74 3.109118 Medium

75 rows × 2 columns

apply and DataFrames#

In fact, DataFrames also have an apply method, that does a similar thing to the apply method of a Series, but on a DataFrame.

Remember, the apply method of the Series calls the supplied function on each value in the Series.

The apply function of a DataFrame calls the supplied function on each row or each column of the DataFrame.

You can specify whether you want to apply the function to each row, or to each column, with the axis keyword argument to the apply method.

This is easier to see in practice, than to describe. We will practice DataFrame .apply on a famous dataset collected by Sir Francis Galton.

Galton’s dataset#

The data we will use relates to a famous paper by Francis Galton, published in 1886. Galton was a versatile scientist who laid the groundwork for early statistics, and particularly regression and correlation. The paper we are interested in here is:

Galton, F. (1886). Regression Towards Mediocrity in Hereditary Stature Journal of the Anthropological Institute, 15, 246-263

In fact, this paper is the origin of the term regression for fitting prediction lines to data.

Galton was a keen eugenicist, and was very interested in inheritance. In this case he studied the relationship of children’s heights to the heights of their parents.

Galton asked families to give him data about:

  • The father’s height

  • The mother’s height

  • The height and gender of each adult child in the family.

You can read more about the data files at the Galton heights datasets page.

The galton_combined.csv file has the data Galton used in his paper:

galton = pd.read_csv('galton_combined.csv')
galton.head()
family father mother midparentHeight children childNum gender childHeight
0 001 78.5 67.0 75.43 4 1 male 73.2
1 001 78.5 67.0 75.43 4 2 female 69.2
2 001 78.5 67.0 75.43 4 3 female 69.0
3 001 78.5 67.0 75.43 4 4 female 69.0
4 002 75.5 66.5 73.66 4 1 male 73.5

Each row is one child. For each child we have their father’s height, their mother’s height, the child’s gender, and the child’s height, among other values.

All heights are in inches.

DataFrame apply#

Like Galton, we are interested in the heritability of height. For example, we may be interested in the difference between the height of the parents and the height of the children. To do this, we may want to subtract the parents height from the children’s height, to get a height difference.

One factor we have to take into account is that males are taller, on average, than females. A very crude way to adjust for this is to subtract the mother’s height from the height of the female children, and the father’s height from the height of the male children.

For example, here is the first row:

first_row = galton.iloc[0]
first_row
family               001
father              78.5
mother              67.0
midparentHeight    75.43
children               4
childNum               1
gender              male
childHeight         73.2
Name: 0, dtype: object

This is a male child, so the difference we want is:

first_difference = first_row['childHeight'] - first_row['father']
first_difference
-5.299999999999997

Here is the second row:

second_row = galton.iloc[1]
second_row
family                001
father               78.5
mother               67.0
midparentHeight     75.43
children                4
childNum                2
gender             female
childHeight          69.2
Name: 1, dtype: object

The difference we want for this female child is:

second_difference = second_row['childHeight'] - second_row['mother']
second_difference
2.200000000000003

To do this calculation for any given row, we could make a function that accepts a row as its argument, and does the calculation. It might look like this:

def apply_on_row(row):
    if row['gender'] == 'female':
        return row['childHeight'] - row['mother']
    elif row['gender'] == 'male':
        return row['childHeight'] - row['father']
    # If neither female or male, we get here.
    return None

Here is that function applied to the first row. As expected, it gives the same value we calculated above:

apply_on_row(galton.iloc[0])
-5.299999999999997

Here is the function applied to the second row:

apply_on_row(galton.iloc[1])
2.200000000000003

We can apply this function to every row, returning a Series, by using the DataFrame apply function with axis=1. axis=1 means apply the function to each value selected across the second axis, that is, across the columns. When we ask for a value across the columns, the function will get one row at a time — one row consists of all the columns values for one row.

When we .apply the function, it returns a new Series, where each value is the result of applying the apply_on_row function to one row:

subtracted_height = galton.apply(apply_on_row, axis=1)
subtracted_height
0     -5.3
1      2.2
2      2.0
3      2.0
4     -2.0
      ... 
929    2.0
930   -4.0
931   -5.0
932    4.0
933   -6.0
Length: 934, dtype: float64

We could put this calculation back into the original DataFrame as a new column:

galton['height_diff'] = subtracted_height
galton.head()
family father mother midparentHeight children childNum gender childHeight height_diff
0 001 78.5 67.0 75.43 4 1 male 73.2 -5.3
1 001 78.5 67.0 75.43 4 2 female 69.2 2.2
2 001 78.5 67.0 75.43 4 3 female 69.0 2.0
3 001 78.5 67.0 75.43 4 4 female 69.0 2.0
4 002 75.5 66.5 73.66 4 1 male 73.5 -2.0