k-means clustering

k-means clustering#

Also see: k-means clustering in the Python Data Science Handbook. You’ll see much inspiration from that page here.

import numpy as np
import pandas as pd
pd.set_option('mode.copy_on_write', True)
import matplotlib.pyplot as plt

import seaborn as sns

We will use the famous iris data set. Quoting from the Wikipedia page above:

The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters.

iris = pd.read_csv('data/iris.csv')
iris

	SepalLength	SepalWidth	PetalLength	PetalWidth	Name
0	5.1	3.5	1.4	0.2	Iris-setosa
1	4.9	3.0	1.4	0.2	Iris-setosa
2	4.7	3.2	1.3	0.2	Iris-setosa
3	4.6	3.1	1.5	0.2	Iris-setosa
4	5.0	3.6	1.4	0.2	Iris-setosa
...	...	...	...	...	...
145	6.7	3.0	5.2	2.3	Iris-virginica
146	6.3	2.5	5.0	1.9	Iris-virginica
147	6.5	3.0	5.2	2.0	Iris-virginica
148	6.2	3.4	5.4	2.3	Iris-virginica
149	5.9	3.0	5.1	1.8	Iris-virginica

150 rows × 5 columns

sns.pairplot(iris, hue='Name')

<seaborn.axisgrid.PairGrid at 0x7fe2763c3b50>

_images/b4592033678c97c9b195ad109ad03ddd5ed41f68131d61699accfb35ea1c2c51.png

features = ['PetalWidth', 'PetalLength']
measures = iris[features]
measures

	PetalWidth	PetalLength
0	0.2	1.4
1	0.2	1.4
2	0.2	1.3
3	0.2	1.5
4	0.2	1.4
...	...	...
145	2.3	5.2
146	1.9	5.0
147	2.0	5.2
148	2.3	5.4
149	1.8	5.1

150 rows × 2 columns

sns.scatterplot(iris,
                x=features[0],
                y=features[1],
                hue='Name')

<Axes: xlabel='PetalWidth', ylabel='PetalLength'>

_images/face97cd5854f8c42da2703871da1eaa020b915af5a7d335906ae8f1a665d76a.png

K-means is a technique for splitting up (classifying) the data into clusters — groups of nearby points. Here is Scikit-learn’s k-means implementation. We use it to identify clusters in the data automatically, without using the species labels. We define the clusters by their centers. Notice though that the clusters it finds are very similar to the clusters by species.

from sklearn.cluster import KMeans

# n_init is the number of different starting states to try.
# The algorithm depends to some extent on starting state.
kmeans_model = KMeans(n_clusters=3, n_init=10)
kmeans_model.fit(measures)
cluster_nos = kmeans_model.predict(measures)

# The measures with Scikit-learn's cluster labels.
labeled_measures = measures.copy()
labeled_measures['cluster'] = cluster_nos
labeled_measures

	PetalWidth	PetalLength	cluster
0	0.2	1.4	0
1	0.2	1.4	0
2	0.2	1.3	0
3	0.2	1.5	0
4	0.2	1.4	0
...	...	...	...
145	2.3	5.2	1
146	1.9	5.0	1
147	2.0	5.2	1
148	2.3	5.4	1
149	1.8	5.1	1

150 rows × 3 columns

These are Scikit-learn’s cluster centers:

kmeans_model.cluster_centers_

array([[0.244     , 1.464     ],
       [2.0375    , 5.59583333],
       [1.34230769, 4.26923077]])

The clusters displayed graphically. Notice how similar they are to the actual species labels, that we have not used here.

sns.scatterplot(labeled_measures,
                x=features[0],
                y=features[1],
                hue='cluster')
plt.scatter(kmeans_model.cluster_centers_[:, 0],
            kmeans_model.cluster_centers_[:, 1],
            color='r', s=100, alpha=0.5);

_images/0f01e4837d2124a47b3ba51c4e09b970b7024cad6f47ab4a20ae383189c09dcf.png

The algorithm of k-means is as follows. We specify we want \(k\) clusters, then:

Select \(k\) points at random from the set to be the starting estimates of the cluster centers.
Repeat the following until the cluster center estimates do not change:

A. Calculate the distance of each point in the dataset to each of the \(k\) clusters.
B. Allocate each point to one of the clusters by choosing the cluster with the shortest distance to the point.
C. With these allocations, calculate new estimated cluster centers by averaging the positions of all points in each cluster.
D. Check if the new cluster centers estimate has changed. If not we are finished.

Remember the loop here as:

A. Calculate the distance.
B. Allocate each point to one the centers.
C. Calculate new centers.
D. Check.

# Step 1: Select k points at random as starting estimates.
n_clusters = 3
cluster_names = np.array(['c0', 'c1', 'c2'])
df = iris[features]
# .sample does sampling without replacement by default.
centers = df.sample(n_clusters).set_index(cluster_names)
centers

	PetalWidth	PetalLength
c0	1.0	3.7
c1	0.2	1.9
c2	1.9	5.1

This is step 2A - calculate the distance of each point to each each center.

def distance(pt1, pt2):
    return np.sqrt(np.sum((pt1 - pt2) ** 2))

# An example distance.
distance(df.iloc[0], centers.iloc[0])

np.float64(2.4351591323771844)

Calculate the distances between all clusters and all points:

distances = pd.DataFrame()
# Distances of all points to cluster center 0
distances['c0'] = df.apply(
    distance,
    pt2=centers.iloc[0],
    axis=1)
# Distances of all points to cluster center 1
distances['c1'] = df.apply(
    distance,
    pt2=centers.iloc[1],
    axis=1)
# Distances of all points to cluster center 2
distances['c2'] = df.apply(
    distance,
    pt2=centers.iloc[2],
    axis=1)
distances

	c0	c1	c2
0	2.435159	0.500000	4.071855
1	2.435159	0.500000	4.071855
2	2.529822	0.600000	4.162932
3	2.340940	0.400000	3.981206
4	2.435159	0.500000	4.071855
...	...	...	...
145	1.984943	3.911521	0.412311
146	1.581139	3.535534	0.100000
147	1.802776	3.758989	0.141421
148	2.140093	4.081666	0.500000
149	1.612452	3.577709	0.100000

150 rows × 3 columns

# We can write the code above in a more compact way.
distances = pd.DataFrame()
for point_no, cluster_name in enumerate(cluster_names):
    distances[cluster_name] = df.apply(
        distance,
        pt2=centers.iloc[point_no],
        axis=1)
distances

	c0	c1	c2
0	2.435159	0.500000	4.071855
1	2.435159	0.500000	4.071855
2	2.529822	0.600000	4.162932
3	2.340940	0.400000	3.981206
4	2.435159	0.500000	4.071855
...	...	...	...
145	1.984943	3.911521	0.412311
146	1.581139	3.535534	0.100000
147	1.802776	3.758989	0.141421
148	2.140093	4.081666	0.500000
149	1.612452	3.577709	0.100000

150 rows × 3 columns

Here are the points and the distances of the points from the three current center estimates:

pd.concat([df, distances], axis=1).round(4)

	PetalWidth	PetalLength	c0	c1	c2
0	0.2	1.4	2.4352	0.5000	4.0719
1	0.2	1.4	2.4352	0.5000	4.0719
2	0.2	1.3	2.5298	0.6000	4.1629
3	0.2	1.5	2.3409	0.4000	3.9812
4	0.2	1.4	2.4352	0.5000	4.0719
...	...	...	...	...	...
145	2.3	5.2	1.9849	3.9115	0.4123
146	1.9	5.0	1.5811	3.5355	0.1000
147	2.0	5.2	1.8028	3.7590	0.1414
148	2.3	5.4	2.1401	4.0817	0.5000
149	1.8	5.1	1.6125	3.5777	0.1000

150 rows × 5 columns

Step 2B - allocate each point to a cluster. Now we can choose a cluster for each point, by choosing the cluster center with the shortest distance to the point.

# Step 2B - allocate each point to a cluster by choosing nearest center
labels = distances.idxmin(axis=1)
labels

    c1
    c1
    c1
    c1
    c1
       ..
  c2
  c2
  c2
  c2
  c2
Length: 150, dtype: object

Step 2C - calculate new estimated cluster centers. We estimate new cluster centers by taking the average of the point coordinates, for the points we have just allocated to each cluster.

new_centers = df.groupby(labels).mean().set_index(cluster_names)
new_centers

	PetalWidth	PetalLength
c0	1.210345	3.955172
c1	0.244000	1.464000
c2	1.866197	5.294366

The next step (2D) is to check whether centers and new_centers are the same - in which case, we have finished the search, and we can stop.

Now let’s run the whole search procedure.

# Make a new random number generator with known state.
# We do this to make sure we get the optimum k-means
# "by accident" on the first run.
rng2 = np.random.default_rng(42)

# Choose random points from set.
centers = df.sample(n_clusters,
                    random_state=rng2  # Use predictable rng
                   ).set_index(cluster_names)

# Repeat for a long time, if necessary.
for i in range(1000):

    # Find distances of each point to each center.
    distances = pd.DataFrame()
    for point_no, cluster_name in enumerate(cluster_names):
        distances[cluster_name] = df.apply(
            distance,
            pt2=centers.iloc[point_no],
            axis=1)
    # Allocate each point to one of the cluster (centers) by
    # choosing the closest center.
    labels = distances.idxmin(axis=1)

    # Make new centers with mean of points in cluster.
    new_centers = df.groupby(labels).mean().set_index(cluster_names)

    # See if we have the same centers as before.
    if np.all(centers == new_centers):
        break  # If same, then stop, we've finished.
    # Otherwise continue
    centers = new_centers
    # Show the current estimated centers.
    print(f'Centers after iteration {i}')
    display(centers)

Centers after iteration 0

	PetalWidth	PetalLength
c0	1.192308	3.903846
c1	0.244000	1.464000
c2	1.845946	5.258108

Centers after iteration 1

	PetalWidth	PetalLength
c0	1.266667	4.105128
c1	0.244000	1.464000
c2	1.937705	5.418033

Centers after iteration 2

	PetalWidth	PetalLength
c0	1.302174	4.191304
c1	0.244000	1.464000
c2	1.994444	5.514815

Centers after iteration 3

	PetalWidth	PetalLength
c0	1.310417	4.220833
c1	0.244000	1.464000
c2	2.013462	5.538462

Centers after iteration 4

	PetalWidth	PetalLength
c0	1.339216	4.254902
c1	0.244000	1.464000
c2	2.026531	5.583673

Centers after iteration 5

	PetalWidth	PetalLength
c0	1.342308	4.269231
c1	0.244000	1.464000
c2	2.037500	5.595833

# Plot the points with their cluster labels.
sns.scatterplot(
    df,
    x=features[0],
    y=features[1],
    hue=labels)
# Plot the centers.
sns.scatterplot(
    centers,
    x=features[0],
    y=features[1],
    color='r',
    s=100,
    alpha=0.5);

_images/ab1f2a2ea278a8c515a938066a6a79ef7987090ff25d711ceea6905ff097ea3a.png

centers

	PetalWidth	PetalLength
c0	1.342308	4.269231
c1	0.244000	1.464000
c2	2.037500	5.595833

Notice that the centers we found are the same as those that scikit-learn found. Scikit-learn uses some tricks to make sure it found the right centers - that it didn’t get stuck with some wrong starting points. We cheated to make sure of this above, by setting the random number generator.

kmeans_model.cluster_centers_

array([[0.244     , 1.464     ],
       [2.0375    , 5.59583333],
       [1.34230769, 4.26923077]])

# Scikit-learn clusters are (near as dammit) the same as
# the ones we found.  The order of the centers is arbitrary,
# solve by sorting.
assert np.allclose(
    np.sort(kmeans_model.cluster_centers_, axis=0),
    np.sort(centers, axis=0))

One measure of how well the points match the clusters is inertia - the sum of the squared distances between each point and its corresponding cluster.

def center_difference_sq(sub_df):
    """ Squared difference between each point and matching center
    """
    return (sub_df - sub_df.mean(axis=0)) ** 2

# .transform applies the function to each sub-dataframe.
df.groupby(labels).transform(center_difference_sq).sum().sum()

np.float64(31.387758974358974)

Scikit-learn does the same calculation.

kmeans_model.inertia_

31.38775897435897

	PetalWidth	PetalLength
0	0.2	1.4
1	0.2	1.4
2	0.2	1.3
3	0.2	1.5
4	0.2	1.4
...	...	...
145	2.3	5.2
146	1.9	5.0
147	2.0	5.2
148	2.3	5.4
149	1.8	5.1

	PetalWidth	PetalLength	cluster
0	0.2	1.4	0
1	0.2	1.4	0
2	0.2	1.3	0
3	0.2	1.5	0
4	0.2	1.4	0
...	...	...	...
145	2.3	5.2	1
146	1.9	5.0	1
147	2.0	5.2	1
148	2.3	5.4	1
149	1.8	5.1	1

	PetalWidth	PetalLength
0	0.2	1.4
1	0.2	1.4
2	0.2	1.3
3	0.2	1.5
4	0.2	1.4
...	...	...
145	2.3	5.2
146	1.9	5.0
147	2.0	5.2
148	2.3	5.4
149	1.8	5.1

	PetalWidth	PetalLength	cluster
0	0.2	1.4	0
1	0.2	1.4	0
2	0.2	1.3	0
3	0.2	1.5	0
4	0.2	1.4	0
...	...	...	...
145	2.3	5.2	1
146	1.9	5.0	1
147	2.0	5.2	1
148	2.3	5.4	1
149	1.8	5.1	1

	PetalWidth	PetalLength
0	0.2	1.4
1	0.2	1.4
2	0.2	1.3
3	0.2	1.5
4	0.2	1.4
...	...	...
145	2.3	5.2
146	1.9	5.0
147	2.0	5.2
148	2.3	5.4
149	1.8	5.1

	PetalWidth	PetalLength	cluster
0	0.2	1.4	0
1	0.2	1.4	0
2	0.2	1.3	0
3	0.2	1.5	0
4	0.2	1.4	0
...	...	...	...
145	2.3	5.2	1
146	1.9	5.0	1
147	2.0	5.2	1
148	2.3	5.4	1
149	1.8	5.1	1