Kernel density estimation

Kernel density estimation#

import numpy as np
import pandas as pd
import scipy.stats as sps
import matplotlib.pyplot as plt
pd.set_option('mode.copy_on_write', True)
import seaborn as sns

Also see: Kernel density estimation in the Python Data Science Handbook.

Let us go back to the Penguin dataset:

penguins = pd.read_csv('data/penguins.csv').dropna()
penguins

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
0	Adelie	Torgersen	39.1	18.7	181.0	3750.0	male	2007
1	Adelie	Torgersen	39.5	17.4	186.0	3800.0	female	2007
2	Adelie	Torgersen	40.3	18.0	195.0	3250.0	female	2007
4	Adelie	Torgersen	36.7	19.3	193.0	3450.0	female	2007
5	Adelie	Torgersen	39.3	20.6	190.0	3650.0	male	2007
...	...	...	...	...	...	...	...	...
339	Chinstrap	Dream	55.8	19.8	207.0	4000.0	male	2009
340	Chinstrap	Dream	43.5	18.1	202.0	3400.0	female	2009
341	Chinstrap	Dream	49.6	18.2	193.0	3775.0	male	2009
342	Chinstrap	Dream	50.8	19.0	210.0	4100.0	male	2009
343	Chinstrap	Dream	50.2	18.7	198.0	3775.0	female	2009

333 rows × 8 columns

In fact, let’s imagine we’re getting ready to do some prediction task, so we’ve split the data into a test and train set.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    penguins[['bill_depth_mm']],
    penguins['species'])
X_train

	bill_depth_mm
98	16.1
203	14.1
286	17.8
192	13.7
199	15.9
...	...
127	18.3
294	18.6
188	13.7
285	19.9
14	21.1

249 rows × 1 columns

For the moment, we’re only going to look at the Chinstrap penguins.

# Select the Chinstrap penguins from the training set.
chinstrap_train = X_train[y_train == 'Chinstrap']
chinstrap_train

	bill_depth_mm
286	17.8
341	18.2
322	17.9
312	18.3
303	19.0
280	19.8
276	17.9
284	18.9
328	17.3
298	16.6
283	18.2
281	17.8
305	20.0
277	19.5
337	16.5
282	18.2
291	19.6
335	19.4
296	17.3
317	17.5
336	19.5
338	17.0
323	19.6
316	19.5
299	19.4
334	18.8
313	20.7
330	17.3
329	19.7
326	16.4
314	16.6
300	17.9
320	17.9
332	16.6
293	17.8
292	20.0
333	19.9
311	16.8
340	18.1
339	19.8
331	18.8
287	20.3
327	19.0
307	20.8
306	16.6
321	18.5
294	18.6
285	19.9

len(chinstrap_train)

We’re particularly interested in the histogram of the bill depth. We’ll use the density version of the histogram, normalized to a bar area of 1:

bill_depth = chinstrap_train['bill_depth_mm']
plt.hist(bill_depth, density=True);

_images/4c11783f5b232c67f95b0718c1a61e9d327534d7e1d867f52d88b0f0aeac7344.png

We can also mark where the actual points fall on the x-axis, like this:

plt.hist(bill_depth, density=True);
tick_y = np.zeros(len(bill_depth)) - 0.01
# Black marks at the actual values.
plt.plot(bill_depth, tick_y, '|k', markeredgewidth=1);
plt.title('Density histogram with 10 bins, point markers');

_images/bf42bb4f57dc3546fdcaf035bb1424c1a5a62b837a699c122e6372f83c590d3b.png

Notice this histogram is rather blocky. We could try increasing the number of bins, but we’ve only got 51 rows to play with:

plt.hist(chinstrap_train, bins=20, density=True);
# Black marks at the actual values.
plt.plot(bill_depth, tick_y, '|k', markeredgewidth=1);
plt.title('Density histogram with 20 bins');

_images/b6d94c006d6bcc5f01fb9ca8e17986b5a9b396dba433e77c6987dbea944a3047.png

Oops - that’s even worse. We can see that small variations in the weights are going to push the blocks back and forth, changing our estimate of the densities.

We’d like some estimate of what this density histogram would look like if we had a lot more data.

This estimate is called a kernel density estimate.

We are going to blur the histogram by replacing the points that stack up into the bars, by little normal distributions. Let’s give these normal distributions a standard deviation of (arbitrarily) 0.25.

dist_sd = 0.25

First consider the maximum point:

bd_max = np.max(bill_depth)
bd_max

np.float64(20.8)

We’ll make a little normal distribution centered on that point:

# Distribution with mean bd_max, standard deviaion dist_d
max_nd = sps.norm(bd_max, dist_sd)

# x axis values for which to calculate y-axis values.
nd_x = np.linspace(
    bill_depth.min() - 4 * dist_sd,
    bill_depth.max() + 4 * dist_sd,
    1000
)

We can plot that on the same plot as before.

# Black marks at the actual values.
plt.plot(bill_depth, tick_y, '|k', markeredgewidth=1);
plt.fill_between(nd_x, max_nd.pdf(nd_x), alpha=0.5, color='r')
plt.title('Gaussian corresponding to maximum value');

_images/fa473dabecaa6302ca533d6afe8a9e7160e31461bb22662b270c9669e6780277.png

Here we’ve replaced the single point (the maximum) with a spread-out version of that point.

Now let’s do the same for all the points.

# Black marks at the actual values.
plt.plot(bill_depth, tick_y, '|k', markeredgewidth=1);
# A normal distribution for each point
values = np.array(chinstrap_train)
for point in values:
    dist_for_pt = sps.norm(point, dist_sd)
    plt.fill_between(nd_x, dist_for_pt.pdf(nd_x), alpha=0.1,
                     color='r')
plt.title('Gaussians comprising KDE estimate');

_images/6c8df2125b5a0a8a4c0157a1e4de4b8bb5fdc827c207730fd2ee0c37405b2813.png

Finally, add up all these normal distributions:

y_values = np.zeros(len(nd_x))
for point in bill_depth:
    dist_for_pt = sps.norm(point, dist_sd)
    y_values += dist_for_pt.pdf(nd_x)

This is the shape of our kernel density estimate - we still need to get the y-axis scaling right, but a few cells down.

plt.fill_between(nd_x, y_values, alpha=0.5, color='r');
plt.title('KDE before normalizing area to 1');

_images/385053c7527824f84fe2f9576ce9a7fda5ddd11d3daa2e45d48b3c53f4ff360e.png

Finally, divide this curve by the area under the curve to get the kernel density estimate.

# The gap between the x values
delta_x = np.diff(nd_x)[0]
# Divide by sum and by gap
kde_y = y_values / np.sum(y_values) / delta_x

plt.hist(chinstrap_train, bins=20, density=True);
# Black marks at the actual values.
plt.plot(bill_depth, tick_y, '|k', markeredgewidth=1);
# Show the kernel density estimate
plt.fill_between(nd_x, kde_y, alpha=0.5, color='r')
plt.title('KDE and raw density histogram');

_images/f9d8687132f1cfc55d7793a4d5b8055886c9c5eae43a024fe09f3527d31e47e0.png

This is our kernel density estimate.

We get the same estimate from Scikit-learn.

from sklearn.neighbors import KernelDensity

# instantiate and fit the KDE model
skl_kde = KernelDensity(bandwidth=dist_sd, kernel='gaussian')
skl_kde.fit(values)

# score_samples returns the log of the probability density.
logprob = skl_kde.score_samples(nd_x[:, None])
# Convert to probability density.
skl_kde_y = np.exp(logprob)

# This gives the same plot as our original calculation.
assert np.allclose(kde_y, skl_kde_y)  # Values the same.
plt.fill_between(nd_x, np.exp(logprob), alpha=0.5, color='r');
plt.title('KDE from Scikit-learn');

_images/be8c7d695e248cd16c304ce87f6e771ac564b9744cb9b0326e18153041a93ed5.png