Kernel density estimation

Kernel density estimation#

import numpy as np
import pandas as pd
import scipy.stats as sps
import matplotlib.pyplot as plt
pd.set_option('mode.copy_on_write', True)
import seaborn as sns

Also see: Kernel density estimation in the Python Data Science Handbook.

Let us go back to the Penguin dataset:

penguins = pd.read_csv('data/penguins.csv').dropna()
penguins
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 male 2007
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 female 2007
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 female 2007
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 female 2007
5 Adelie Torgersen 39.3 20.6 190.0 3650.0 male 2007
... ... ... ... ... ... ... ... ...
339 Chinstrap Dream 55.8 19.8 207.0 4000.0 male 2009
340 Chinstrap Dream 43.5 18.1 202.0 3400.0 female 2009
341 Chinstrap Dream 49.6 18.2 193.0 3775.0 male 2009
342 Chinstrap Dream 50.8 19.0 210.0 4100.0 male 2009
343 Chinstrap Dream 50.2 18.7 198.0 3775.0 female 2009

333 rows × 8 columns

In fact, let’s imagine we’re getting ready to do some prediction task, so we’ve split the data into a test and train set.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    penguins[['bill_depth_mm']],
    penguins['species'])
X_train
bill_depth_mm
98 16.1
203 14.1
286 17.8
192 13.7
199 15.9
... ...
127 18.3
294 18.6
188 13.7
285 19.9
14 21.1

249 rows × 1 columns

For the moment, we’re only going to look at the Chinstrap penguins.

# Select the Chinstrap penguins from the training set.
chinstrap_train = X_train[y_train == 'Chinstrap']
chinstrap_train
bill_depth_mm
286 17.8
341 18.2
322 17.9
312 18.3
303 19.0
280 19.8
276 17.9
284 18.9
328 17.3
298 16.6
283 18.2
281 17.8
305 20.0
277 19.5
337 16.5
282 18.2
291 19.6
335 19.4
296 17.3
317 17.5
336 19.5
338 17.0
323 19.6
316 19.5
299 19.4
334 18.8
313 20.7
330 17.3
329 19.7
326 16.4
314 16.6
300 17.9
320 17.9
332 16.6
293 17.8
292 20.0
333 19.9
311 16.8
340 18.1
339 19.8
331 18.8
287 20.3
327 19.0
307 20.8
306 16.6
321 18.5
294 18.6
285 19.9
len(chinstrap_train)
48

We’re particularly interested in the histogram of the bill depth. We’ll use the density version of the histogram, normalized to a bar area of 1:

bill_depth = chinstrap_train['bill_depth_mm']
plt.hist(bill_depth, density=True);
_images/4c11783f5b232c67f95b0718c1a61e9d327534d7e1d867f52d88b0f0aeac7344.png

We can also mark where the actual points fall on the x-axis, like this:

plt.hist(bill_depth, density=True);
tick_y = np.zeros(len(bill_depth)) - 0.01
# Black marks at the actual values.
plt.plot(bill_depth, tick_y, '|k', markeredgewidth=1);
plt.title('Density histogram with 10 bins, point markers');
_images/bf42bb4f57dc3546fdcaf035bb1424c1a5a62b837a699c122e6372f83c590d3b.png

Notice this histogram is rather blocky. We could try increasing the number of bins, but we’ve only got 51 rows to play with:

plt.hist(chinstrap_train, bins=20, density=True);
# Black marks at the actual values.
plt.plot(bill_depth, tick_y, '|k', markeredgewidth=1);
plt.title('Density histogram with 20 bins');
_images/b6d94c006d6bcc5f01fb9ca8e17986b5a9b396dba433e77c6987dbea944a3047.png

Oops - that’s even worse. We can see that small variations in the weights are going to push the blocks back and forth, changing our estimate of the densities.

We’d like some estimate of what this density histogram would look like if we had a lot more data.

This estimate is called a kernel density estimate.

We are going to blur the histogram by replacing the points that stack up into the bars, by little normal distributions. Let’s give these normal distributions a standard deviation of (arbitrarily) 0.25.

dist_sd = 0.25

First consider the maximum point:

bd_max = np.max(bill_depth)
bd_max
np.float64(20.8)

We’ll make a little normal distribution centered on that point:

# Distribution with mean bd_max, standard deviaion dist_d
max_nd = sps.norm(bd_max, dist_sd)
# x axis values for which to calculate y-axis values.
nd_x = np.linspace(
    bill_depth.min() - 4 * dist_sd,
    bill_depth.max() + 4 * dist_sd,
    1000
)

We can plot that on the same plot as before.

# Black marks at the actual values.
plt.plot(bill_depth, tick_y, '|k', markeredgewidth=1);
plt.fill_between(nd_x, max_nd.pdf(nd_x), alpha=0.5, color='r')
plt.title('Gaussian corresponding to maximum value');
_images/fa473dabecaa6302ca533d6afe8a9e7160e31461bb22662b270c9669e6780277.png

Here we’ve replaced the single point (the maximum) with a spread-out version of that point.

Now let’s do the same for all the points.

# Black marks at the actual values.
plt.plot(bill_depth, tick_y, '|k', markeredgewidth=1);
# A normal distribution for each point
values = np.array(chinstrap_train)
for point in values:
    dist_for_pt = sps.norm(point, dist_sd)
    plt.fill_between(nd_x, dist_for_pt.pdf(nd_x), alpha=0.1,
                     color='r')
plt.title('Gaussians comprising KDE estimate');
_images/6c8df2125b5a0a8a4c0157a1e4de4b8bb5fdc827c207730fd2ee0c37405b2813.png

Finally, add up all these normal distributions:

y_values = np.zeros(len(nd_x))
for point in bill_depth:
    dist_for_pt = sps.norm(point, dist_sd)
    y_values += dist_for_pt.pdf(nd_x)

This is the shape of our kernel density estimate - we still need to get the y-axis scaling right, but a few cells down.

plt.fill_between(nd_x, y_values, alpha=0.5, color='r');
plt.title('KDE before normalizing area to 1');
_images/385053c7527824f84fe2f9576ce9a7fda5ddd11d3daa2e45d48b3c53f4ff360e.png

Finally, divide this curve by the area under the curve to get the kernel density estimate.

# The gap between the x values
delta_x = np.diff(nd_x)[0]
# Divide by sum and by gap
kde_y = y_values / np.sum(y_values) / delta_x
plt.hist(chinstrap_train, bins=20, density=True);
# Black marks at the actual values.
plt.plot(bill_depth, tick_y, '|k', markeredgewidth=1);
# Show the kernel density estimate
plt.fill_between(nd_x, kde_y, alpha=0.5, color='r')
plt.title('KDE and raw density histogram');
_images/f9d8687132f1cfc55d7793a4d5b8055886c9c5eae43a024fe09f3527d31e47e0.png

This is our kernel density estimate.

We get the same estimate from Scikit-learn.

from sklearn.neighbors import KernelDensity

# instantiate and fit the KDE model
skl_kde = KernelDensity(bandwidth=dist_sd, kernel='gaussian')
skl_kde.fit(values)

# score_samples returns the log of the probability density.
logprob = skl_kde.score_samples(nd_x[:, None])
# Convert to probability density.
skl_kde_y = np.exp(logprob)

# This gives the same plot as our original calculation.
assert np.allclose(kde_y, skl_kde_y)  # Values the same.
plt.fill_between(nd_x, np.exp(logprob), alpha=0.5, color='r');
plt.title('KDE from Scikit-learn');
_images/be8c7d695e248cd16c304ce87f6e771ac564b9744cb9b0326e18153041a93ed5.png