Dimension reduction with Principal Component Analysis

Dimension reduction with Principal Component Analysis#

See the PCA introduction for background.

Start by loading the libraries we need, and doing some configuration:

import numpy as np
import numpy.linalg as npl
# Display array values to 6 digits of precision
np.set_printoptions(precision=6, suppress=True)

import pandas as pd
pd.set_option('mode.copy_on_write', True)

import matplotlib.pyplot as plt

from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

As for the PCA page above, we will use the English Premier League (EPL) table data for this example. See the EPL dataset page for more on this dataset.

df = pd.read_csv('data/premier_league_2021.csv')
df.head()
rank team played won drawn lost for against goal_difference points wages_year keeper defense midfield forward
0 1 Manchester City 38 29 6 3 99 26 73 93 168572 8892 60320 24500 74860
1 2 Liverpool 38 28 8 2 94 26 68 92 148772 14560 46540 47320 40352
2 3 Chelsea 38 21 11 6 76 33 43 74 187340 12480 51220 51100 72540
3 4 Tottenham Hotspur 38 22 5 11 69 40 29 71 110416 6760 29516 30680 43460
4 5 Arsenal 38 22 3 13 61 48 13 69 118074 8400 37024 27300 45350

We want to predict goal difference from spending on the various position types.

We have spending data on keeper defense, midfield and forward.

y = df['goal_difference']
# Divide spending by 10,000 GBP to make numbers easier to read.
X = df[['keeper', 'defense', 'midfield', 'forward']] / 10_000
X.head()
keeper defense midfield forward
0 0.8892 6.0320 2.450 7.4860
1 1.4560 4.6540 4.732 4.0352
2 1.2480 5.1220 5.110 7.2540
3 0.6760 2.9516 3.068 4.3460
4 0.8400 3.7024 2.730 4.5350
basic_reg = LinearRegression().fit(X, y)
basic_reg.coef_
array([-15.502165,  24.888585,   6.337194,  -5.453179])

How good is the model fit?

fitted = basic_reg.predict(X)
fitted
array([ 58.702936,  48.899144,  48.61422 ,   6.380567,  19.351939,
        15.741829,  -7.189155,  -0.542271, -14.657475, -10.357978,
         5.580663, -11.703874, -33.987123, -10.908234, -38.285969,
        12.001697, -30.212118, -22.902239, -14.680438, -19.84612 ])
full_r2 = r2_score(y, fitted)
full_r2
0.6593261703105764

It turns out the variables (features) are highly correlated, one with another.

X.corr()
keeper defense midfield forward
keeper 1.000000 0.770622 0.653607 0.836858
defense 0.770622 1.000000 0.792146 0.918498
midfield 0.653607 0.792146 1.000000 0.718424
forward 0.836858 0.918498 0.718424 1.000000

We want to reduce these four variables down to fewer variables, while still capturing the same information.

Let us first run PCA on the four columns:

pca = PCA()
pca = pca.fit(X)

# Scikit-learn's components.
pca.components_
array([[ 0.162258,  0.467464,  0.330791,  0.803571],
       [-0.044097,  0.155829,  0.881052, -0.444433],
       [-0.229876,  0.859756, -0.324967, -0.319959],
       [ 0.958585,  0.134218, -0.093391, -0.233192]])

Review the scree-plot of the component weights:

plt.plot(pca.singular_values_)
[<matplotlib.lines.Line2D at 0x7f6b17deb110>]
_images/c16a537a1209e65db32bddfa9559afba9564335ec40d6d0acac892d0ccee489a.png

We can calculate the amount of each row explained by each component, by using fit_transform of the PCA object:

comp_vals = pca.fit_transform(X)
comp_vals
array([[ 4.885102, -0.949893,  1.316577, -0.44375 ],
       [ 2.314803,  2.354588,  0.364081,  0.506204],
       [ 5.211402,  1.339186, -0.33846 , -0.416269],
       [ 1.091745,  0.5195  , -0.478962, -0.387058],
       [ 1.509396,  0.247471,  0.178209, -0.141586],
       [ 8.670343, -1.625328, -0.7396  ,  0.447318],
       [-0.853819, -0.589592,  0.539047,  0.666987],
       [-0.789195,  0.171897,  0.305088,  0.445527],
       [-2.274588, -0.086226,  0.035262, -0.101522],
       [-1.31331 , -0.407973, -0.022812, -0.397283],
       [-1.053835,  0.15501 ,  0.434515, -0.011451],
       [-1.112012,  0.134205, -0.260818, -0.058728],
       [-3.420161, -0.259338, -0.349634, -0.116371],
       [-0.140709, -0.665132, -0.05001 ,  0.032668],
       [-1.937029, -0.105797, -0.84301 ,  0.271394],
       [ 0.893833,  0.785698, -0.225254, -0.144854],
       [-2.666776, -0.942176, -0.100787, -0.215631],
       [-2.68459 , -0.439447,  0.004598, -0.078675],
       [-2.915953,  0.472992, -0.047468, -0.043406],
       [-3.414648, -0.109644,  0.279437,  0.186487]])

If you have followed the PCA introduction page, you will know you can also get these scores by matrix multiplication:

# Getting component values by using matrix multiplication.
demeaned_X = X - np.mean(X, axis=0)
demeaned_X.dot(pca.components_.T)
0 1 2 3
0 4.885102 -0.949893 1.316577 -0.443750
1 2.314803 2.354588 0.364081 0.506204
2 5.211402 1.339186 -0.338460 -0.416269
3 1.091745 0.519500 -0.478962 -0.387058
4 1.509396 0.247471 0.178209 -0.141586
5 8.670343 -1.625328 -0.739600 0.447318
6 -0.853819 -0.589592 0.539047 0.666987
7 -0.789195 0.171897 0.305088 0.445527
8 -2.274588 -0.086226 0.035262 -0.101522
9 -1.313310 -0.407973 -0.022812 -0.397283
10 -1.053835 0.155010 0.434515 -0.011451
11 -1.112012 0.134205 -0.260818 -0.058728
12 -3.420161 -0.259338 -0.349634 -0.116371
13 -0.140709 -0.665132 -0.050010 0.032668
14 -1.937029 -0.105797 -0.843010 0.271394
15 0.893833 0.785698 -0.225254 -0.144854
16 -2.666776 -0.942176 -0.100787 -0.215631
17 -2.684590 -0.439447 0.004598 -0.078675
18 -2.915953 0.472992 -0.047468 -0.043406
19 -3.414648 -0.109644 0.279437 0.186487

We can use these four columns instead of the original four columns as regressors. Think of the component scores (values) as being a re-expression of the original four columns, that keeps all the same information. When we fit this model, the component scores will get their own coefficients, but because those columns contain the same information as the original columns, the resulting fitted values will be the same as for the original model (within the precision of the calculations).

comp_reg = LinearRegression().fit(comp_vals, y)
comp_fitted = comp_reg.predict(comp_vals)
# The columns and parameters are different, but the fitted values
# are the same.
np.allclose(comp_fitted, fitted)
True

Now we’ve seen the scree-plot, we might conclude that only the first component is important, and we can discard the others. We then just include the first component values a new model.

comp1_vals = comp_vals[:, :1]  # A 2D array with one column
comp1_vals
array([[ 4.885102],
       [ 2.314803],
       [ 5.211402],
       [ 1.091745],
       [ 1.509396],
       [ 8.670343],
       [-0.853819],
       [-0.789195],
       [-2.274588],
       [-1.31331 ],
       [-1.053835],
       [-1.112012],
       [-3.420161],
       [-0.140709],
       [-1.937029],
       [ 0.893833],
       [-2.666776],
       [-2.68459 ],
       [-2.915953],
       [-3.414648]])
comp1_reg = LinearRegression().fit(comp1_vals, y)
comp1_reg.coef_
array([6.833448])

We don’t have quite the same quality of model fit — as measured by \(R^2\), for example, but we have recovered most of the \(R^2\) using this single regressor instead of the original four.

comp1_fitted = comp1_reg.predict(comp1_vals)
r2_score(y, comp1_fitted)
0.42171885321184366

We could improve the fit a little by adding the second component:

comp12_reg = LinearRegression().fit(comp_vals[:, :2], y)
r2_score(y, comp12_reg.predict(comp_vals[:, :2]))
0.525326169266535

And a little more with the third:

comp123_reg = LinearRegression().fit(comp_vals[:, :3], y)
r2_score(y, comp123_reg.predict(comp_vals[:, :3]))
0.6484639642971903

Until we get the full \(R^2\) with all four components (which gives the same underling model as using the original four regressors):

comp_reg = LinearRegression().fit(comp_vals, y)
r2_score(y, comp_reg.predict(comp_vals))
0.6593261703105762