Dimension reduction with Principal Component Analysis

Dimension reduction with Principal Component Analysis#

See the PCA introduction for background.

Start by loading the libraries we need, and doing some configuration:

import numpy as np
import numpy.linalg as npl
# Display array values to 6 digits of precision
np.set_printoptions(precision=6, suppress=True)

import pandas as pd
pd.set_option('mode.copy_on_write', True)

import matplotlib.pyplot as plt

from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

As for the PCA page above, we will use the English Premier League (EPL) table data for this example. See the EPL dataset page for more on this dataset.

df = pd.read_csv('data/premier_league_2021.csv')
df.head()

	rank	team	played	won	drawn	lost	for	against	goal_difference	points	wages_year	keeper	defense	midfield	forward
0	1	Manchester City	38	29	6	3	99	26	73	93	168572	8892	60320	24500	74860
1	2	Liverpool	38	28	8	2	94	26	68	92	148772	14560	46540	47320	40352
2	3	Chelsea	38	21	11	6	76	33	43	74	187340	12480	51220	51100	72540
3	4	Tottenham Hotspur	38	22	5	11	69	40	29	71	110416	6760	29516	30680	43460
4	5	Arsenal	38	22	3	13	61	48	13	69	118074	8400	37024	27300	45350

We want to predict goal difference from spending on the various position types.

We have spending data on keeper defense, midfield and forward.

y = df['goal_difference']
# Divide spending by 10,000 GBP to make numbers easier to read.
X = df[['keeper', 'defense', 'midfield', 'forward']] / 10_000
X.head()

	keeper	defense	midfield	forward
0	0.8892	6.0320	2.450	7.4860
1	1.4560	4.6540	4.732	4.0352
2	1.2480	5.1220	5.110	7.2540
3	0.6760	2.9516	3.068	4.3460
4	0.8400	3.7024	2.730	4.5350

basic_reg = LinearRegression().fit(X, y)
basic_reg.coef_

array([-15.502165,  24.888585,   6.337194,  -5.453179])

How good is the model fit?

fitted = basic_reg.predict(X)
fitted

array([ 58.702936,  48.899144,  48.61422 ,   6.380567,  19.351939,
        15.741829,  -7.189155,  -0.542271, -14.657475, -10.357978,
         5.580663, -11.703874, -33.987123, -10.908234, -38.285969,
        12.001697, -30.212118, -22.902239, -14.680438, -19.84612 ])

full_r2 = r2_score(y, fitted)
full_r2

0.6593261703105764

It turns out the variables (features) are highly correlated, one with another.

X.corr()

	keeper	defense	midfield	forward
keeper	1.000000	0.770622	0.653607	0.836858
defense	0.770622	1.000000	0.792146	0.918498
midfield	0.653607	0.792146	1.000000	0.718424
forward	0.836858	0.918498	0.718424	1.000000

We want to reduce these four variables down to fewer variables, while still capturing the same information.

Let us first run PCA on the four columns:

pca = PCA()
pca = pca.fit(X)

# Scikit-learn's components.
pca.components_

array([[ 0.162258,  0.467464,  0.330791,  0.803571],
       [-0.044097,  0.155829,  0.881052, -0.444433],
       [-0.229876,  0.859756, -0.324967, -0.319959],
       [ 0.958585,  0.134218, -0.093391, -0.233192]])

Review the scree-plot of the component weights:

plt.plot(pca.singular_values_)

[<matplotlib.lines.Line2D at 0x7f6b17deb110>]

_images/c16a537a1209e65db32bddfa9559afba9564335ec40d6d0acac892d0ccee489a.png

We can calculate the amount of each row explained by each component, by using fit_transform of the PCA object:

comp_vals = pca.fit_transform(X)
comp_vals

array([[ 4.885102, -0.949893,  1.316577, -0.44375 ],
       [ 2.314803,  2.354588,  0.364081,  0.506204],
       [ 5.211402,  1.339186, -0.33846 , -0.416269],
       [ 1.091745,  0.5195  , -0.478962, -0.387058],
       [ 1.509396,  0.247471,  0.178209, -0.141586],
       [ 8.670343, -1.625328, -0.7396  ,  0.447318],
       [-0.853819, -0.589592,  0.539047,  0.666987],
       [-0.789195,  0.171897,  0.305088,  0.445527],
       [-2.274588, -0.086226,  0.035262, -0.101522],
       [-1.31331 , -0.407973, -0.022812, -0.397283],
       [-1.053835,  0.15501 ,  0.434515, -0.011451],
       [-1.112012,  0.134205, -0.260818, -0.058728],
       [-3.420161, -0.259338, -0.349634, -0.116371],
       [-0.140709, -0.665132, -0.05001 ,  0.032668],
       [-1.937029, -0.105797, -0.84301 ,  0.271394],
       [ 0.893833,  0.785698, -0.225254, -0.144854],
       [-2.666776, -0.942176, -0.100787, -0.215631],
       [-2.68459 , -0.439447,  0.004598, -0.078675],
       [-2.915953,  0.472992, -0.047468, -0.043406],
       [-3.414648, -0.109644,  0.279437,  0.186487]])

If you have followed the PCA introduction page, you will know you can also get these scores by matrix multiplication:

# Getting component values by using matrix multiplication.
demeaned_X = X - np.mean(X, axis=0)
demeaned_X.dot(pca.components_.T)

	0	1	2	3
0	4.885102	-0.949893	1.316577	-0.443750
1	2.314803	2.354588	0.364081	0.506204
2	5.211402	1.339186	-0.338460	-0.416269
3	1.091745	0.519500	-0.478962	-0.387058
4	1.509396	0.247471	0.178209	-0.141586
5	8.670343	-1.625328	-0.739600	0.447318
6	-0.853819	-0.589592	0.539047	0.666987
7	-0.789195	0.171897	0.305088	0.445527
8	-2.274588	-0.086226	0.035262	-0.101522
9	-1.313310	-0.407973	-0.022812	-0.397283
10	-1.053835	0.155010	0.434515	-0.011451
11	-1.112012	0.134205	-0.260818	-0.058728
12	-3.420161	-0.259338	-0.349634	-0.116371
13	-0.140709	-0.665132	-0.050010	0.032668
14	-1.937029	-0.105797	-0.843010	0.271394
15	0.893833	0.785698	-0.225254	-0.144854
16	-2.666776	-0.942176	-0.100787	-0.215631
17	-2.684590	-0.439447	0.004598	-0.078675
18	-2.915953	0.472992	-0.047468	-0.043406
19	-3.414648	-0.109644	0.279437	0.186487

We can use these four columns instead of the original four columns as regressors. Think of the component scores (values) as being a re-expression of the original four columns, that keeps all the same information. When we fit this model, the component scores will get their own coefficients, but because those columns contain the same information as the original columns, the resulting fitted values will be the same as for the original model (within the precision of the calculations).

comp_reg = LinearRegression().fit(comp_vals, y)
comp_fitted = comp_reg.predict(comp_vals)
# The columns and parameters are different, but the fitted values
# are the same.
np.allclose(comp_fitted, fitted)

True

Now we’ve seen the scree-plot, we might conclude that only the first component is important, and we can discard the others. We then just include the first component values a new model.

comp1_vals = comp_vals[:, :1]  # A 2D array with one column
comp1_vals

array([[ 4.885102],
       [ 2.314803],
       [ 5.211402],
       [ 1.091745],
       [ 1.509396],
       [ 8.670343],
       [-0.853819],
       [-0.789195],
       [-2.274588],
       [-1.31331 ],
       [-1.053835],
       [-1.112012],
       [-3.420161],
       [-0.140709],
       [-1.937029],
       [ 0.893833],
       [-2.666776],
       [-2.68459 ],
       [-2.915953],
       [-3.414648]])

comp1_reg = LinearRegression().fit(comp1_vals, y)
comp1_reg.coef_

array([6.833448])

We don’t have quite the same quality of model fit — as measured by \(R^2\), for example, but we have recovered most of the \(R^2\) using this single regressor instead of the original four.

comp1_fitted = comp1_reg.predict(comp1_vals)
r2_score(y, comp1_fitted)

0.42171885321184366

We could improve the fit a little by adding the second component:

comp12_reg = LinearRegression().fit(comp_vals[:, :2], y)
r2_score(y, comp12_reg.predict(comp_vals[:, :2]))

0.525326169266535

And a little more with the third:

comp123_reg = LinearRegression().fit(comp_vals[:, :3], y)
r2_score(y, comp123_reg.predict(comp_vals[:, :3]))

0.6484639642971903

Until we get the full \(R^2\) with all four components (which gives the same underling model as using the original four regressors):

comp_reg = LinearRegression().fit(comp_vals, y)
r2_score(y, comp_reg.predict(comp_vals))

0.6593261703105762