Apply functions

Apply functions#

Functions are named recipes. We often find that we want to apply such a recipe to each value in a Pandas Series.

This is the job of the apply method of a Series.

To start, we load the familiar Ratings data, giving the average ratings on various measures for all rated professors teaching a subject, such as English, mathematics, and so on.

import numpy as np
import pandas as pd

# Safe setting for using Pandas.
pd.set_option('mode.copy_on_write', True)

ratings = pd.read_csv('rate_my_course.csv')
ratings.head()

	Discipline	Number of Professors	Clarity	Helpfulness	Overall Quality	Easiness
0	English	23343	3.756147	3.821866	3.791364	3.162754
1	Mathematics	22394	3.487379	3.641526	3.566867	3.063322
2	Biology	11774	3.608331	3.701530	3.657641	2.710459
3	Psychology	11179	3.909520	3.887536	3.900949	3.316210
4	History	11145	3.788818	3.753642	3.773746	3.053803

There are 75 rows in this dataset, one row for each of 75 subjects:

n = len(ratings)
n

Now let us say that we are interested in the Easiness ratings:

easiness = ratings['Easiness']
easiness

   3.162754
   3.063322
   2.710459
   3.316210
   3.053803
        ...   
  2.863504
  3.106727
  3.309636
  2.799135
  3.109118
Name: Easiness, Length: 75, dtype: float64

We decide we want to classify each subject into one of three groups:

“Easy” for courses that have Easiness scores above the 75% percentile for Easiness.
“Hard” for courses below the 25% percentile.
“Medium” for courses between the 25% and 75% percentile.

You might first wonder how to get the percentiles. One way to do it, is to use the percentile function from Numpy. For example, the median is the value at the 50% percentile, meaning that half the values are below the median and half are above (well, it’s a little more complicated than that, but that’s right to a first pass).

Here’s the median:

np.median(easiness)

3.19430041152263

This is, by definition, the 50% percentile:

np.percentile(easiness, 50)

3.19430041152263

Here are the 25% and 75% percentiles:

easy_25 = np.percentile(easiness, 25)
print('25% percentile is', easy_25)
easy_75 = np.percentile(easiness, 75)
print('75% percentile is', easy_75)

25% percentile is 3.0283298724604153
75% percentile is 3.34694063174731

We can then write a function that we send a value (call the value v), and return the classification, using the easy_25 and easy_75 values.

Notice that the function can see the easy_25 and easy_75 values in the top-level workspace:

def classify_easy(v):
    if v < easy_25:
        return 'Hard'
    if v > easy_75:
        return 'Easy'
    return 'Medium'

We expect this function to return “Hard” for a value of 3 (it’s below the 25% percentile):

classify_easy(3)

'Hard'

It should return “Easy” for a value of 3.4 (it’s above the 75% percentile):

classify_easy(3.4)

'Easy'

An intermediate value should give “Medium”:

classify_easy(3.2)

'Medium'

Now let us imagine we want to apply this function to all the values in the easiness series.

We could do this laboriously, by making an array to store the values, like this:

classified = np.repeat(['Unknown'], n)
classified[:10]

array(['Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown',
       'Unknown', 'Unknown', 'Unknown', 'Unknown'], dtype='<U7')

This is an array with 75 values of the string “Unknown”. We are going to replace the “Unknown” values with the classifications from calling classify_easy on each of the easiness scores in turn, like this:

# The long way to apply classify_easy to easiness.
for i in np.arange(n):
    # Get the easiness value at this position.
    value = easiness.iloc[i]
    # Call `classify_easy` on this value, put it into
    # the `classified` array.
    classified[i] = classify_easy(value)
classified[:10]

array(['Medium', 'Medium', 'Hard', 'Medium', 'Medium', 'Hard', 'Easy',
       'Medium', 'Medium', 'Hard'], dtype='<U7')

That is the long way to do that task, because Pandas Series have an apply method to do just that. We send the apply method the function we want to apply on each value in the Series, and it calls the function on each value, and returns the result, as a new Series:

classified_series = easiness.apply(classify_easy)
classified_series

   Medium
   Medium
     Hard
   Medium
   Medium
       ...  
    Hard
  Medium
  Medium
    Hard
  Medium
Name: Easiness, Length: 75, dtype: object

To see the results, we might make a new DataFrame to show the original Easiness scores and the classification side-by-side.

df = pd.DataFrame()
df['Easiness'] = easiness
df['Easiness group'] = classified_series
df

	Easiness	Easiness group
0	3.162754	Medium
1	3.063322	Medium
2	2.710459	Hard
3	3.316210	Medium
4	3.053803	Medium
...	...	...
70	2.863504	Hard
71	3.106727	Medium
72	3.309636	Medium
73	2.799135	Hard
74	3.109118	Medium

75 rows × 2 columns

`apply` and DataFrames#

In fact, DataFrames also have an apply method, that does a similar thing to the apply method of a Series, but on a DataFrame.

Remember, the apply method of the Series calls the supplied function on each value in the Series.

The apply function of a DataFrame calls the supplied function on each row or each column of the DataFrame.

You can specify whether you want to apply the function to each row, or to each column, with the axis keyword argument to the apply method.

This is easier to see in practice, than to describe. We will practice DataFrame .apply on a famous dataset collected by Sir Francis Galton.

Galton’s dataset#

The data we will use relates to a famous paper by Francis Galton, published in 1886. Galton was a versatile scientist who laid the groundwork for early statistics, and particularly regression and correlation. The paper we are interested in here is:

Galton, F. (1886). Regression Towards Mediocrity in Hereditary Stature Journal of the Anthropological Institute, 15, 246-263

In fact, this paper is the origin of the term regression for fitting prediction lines to data.

Galton was a keen eugenicist, and was very interested in inheritance. In this case he studied the relationship of children’s heights to the heights of their parents.

Galton asked families to give him data about:

The father’s height
The mother’s height
The height and gender of each adult child in the family.

You can read more about the data files at the Galton heights datasets page.

The galton_combined.csv file has the data Galton used in his paper:

galton = pd.read_csv('galton_combined.csv')
galton.head()

	family	father	mother	midparentHeight	children	childNum	gender	childHeight
0	001	78.5	67.0	75.43	4	1	male	73.2
1	001	78.5	67.0	75.43	4	2	female	69.2
2	001	78.5	67.0	75.43	4	3	female	69.0
3	001	78.5	67.0	75.43	4	4	female	69.0
4	002	75.5	66.5	73.66	4	1	male	73.5

Each row is one child. For each child we have their father’s height, their mother’s height, the child’s gender, and the child’s height, among other values.

All heights are in inches.

DataFrame `apply`#

Like Galton, we are interested in the heritability of height. For example, we may be interested in the difference between the height of the parents and the height of the children. To do this, we may want to subtract the parents height from the children’s height, to get a height difference.

One factor we have to take into account is that males are taller, on average, than females. A very crude way to adjust for this is to subtract the mother’s height from the height of the female children, and the father’s height from the height of the male children.

For example, here is the first row:

first_row = galton.iloc[0]
first_row

family               001
father              78.5
mother              67.0
midparentHeight    75.43
children               4
childNum               1
gender              male
childHeight         73.2
Name: 0, dtype: object

This is a male child, so the difference we want is:

first_difference = first_row['childHeight'] - first_row['father']
first_difference

-5.299999999999997

Here is the second row:

second_row = galton.iloc[1]
second_row

family                001
father               78.5
mother               67.0
midparentHeight     75.43
children                4
childNum                2
gender             female
childHeight          69.2
Name: 1, dtype: object

The difference we want for this female child is:

second_difference = second_row['childHeight'] - second_row['mother']
second_difference

2.200000000000003

To do this calculation for any given row, we could make a function that accepts a row as its argument, and does the calculation. It might look like this:

def apply_on_row(row):
    if row['gender'] == 'female':
        return row['childHeight'] - row['mother']
    elif row['gender'] == 'male':
        return row['childHeight'] - row['father']
    # If neither female or male, we get here.
    return None

Here is that function applied to the first row. As expected, it gives the same value we calculated above:

apply_on_row(galton.iloc[0])

-5.299999999999997

Here is the function applied to the second row:

apply_on_row(galton.iloc[1])

2.200000000000003

We can apply this function to every row, returning a Series, by using the DataFrame apply function with axis=1. axis=1 means apply the function to each value selected across the second axis, that is, across the columns. When we ask for a value across the columns, the function will get one row at a time — one row consists of all the columns values for one row.

When we .apply the function, it returns a new Series, where each value is the result of applying the apply_on_row function to one row:

subtracted_height = galton.apply(apply_on_row, axis=1)
subtracted_height

   -5.3
    2.2
    2.0
    2.0
   -2.0
      ... 
  2.0
 -4.0
 -5.0
  4.0
 -6.0
Length: 934, dtype: float64

We could put this calculation back into the original DataFrame as a new column:

galton['height_diff'] = subtracted_height
galton.head()

	family	father	mother	midparentHeight	children	childNum	gender	childHeight	height_diff
0	001	78.5	67.0	75.43	4	1	male	73.2	-5.3
1	001	78.5	67.0	75.43	4	2	female	69.2	2.2
2	001	78.5	67.0	75.43	4	3	female	69.0	2.0
3	001	78.5	67.0	75.43	4	4	female	69.0	2.0
4	002	75.5	66.5	73.66	4	1	male	73.5	-2.0

Apply functions

Contents

Apply functions#

apply and DataFrames#

Galton’s dataset#

DataFrame apply#

`apply` and DataFrames#

DataFrame `apply`#