Expressing the linear model with matrices

Expressing the linear model with matrices#

This page is about the notation for linear models.

The linear model can be expressed in terms of matrices. This makes it:

Easier to write equations for the model
Gives a mathematical expression that is easy to solve with matrix functions.

For more detail on the General Linear Model, see The general linear model and fMRI: Does love last forever?.

# Import numerical, data and plotting libraries
import numpy as np
import numpy.linalg as npl
import pandas as pd
pd.set_option('mode.copy_on_write', True)
import matplotlib.pyplot as plt
# Only show 6 decimals when printing
np.set_printoptions(precision=6)

<Token var=<ContextVar name='format_options' default={'edgeitems': 3, 'threshold': 1000, 'floatmode': 'maxprec', 'precision': 8, 'suppress': False, 'linewidth': 75, 'nanstr': 'nan', 'infstr': 'inf', 'sign': '-', 'formatter': None, 'legacy': 9223372036854775807, 'override_repr': None} at 0x7f58e0bcf1a0> at 0x7f58a6724280>

This page starts with the model for simple regression. We then show how the simple regression gets expressed in a design matrix. Once we have that, it’s easy to extend simple regression to multiple regression. By adding some specially formed regressors, we can also express group membership.

Simple regression#

We return to the example in linear regression notation, predicting “prestige” scores for professions, from various other measures.

df = pd.read_csv('data/Duncan_Occupational_Prestige_Reduced.csv')
df

	name	type	income	education	prestige
0	lawyer	prof	76	98	89
1	banker	prof	78	82	92
2	machine.operator	bc	21	20	24
3	soda.clerk	bc	12	30	6
4	bartender	bc	16	28	7
5	waiter	bc	8	32	10
6	policeman	bc	34	47	41
7	teacher	prof	48	91	73
8	professor	prof	64	93	93
9	shoe.shiner	bc	9	17	3
10	machinist	bc	36	32	57
11	minister	prof	21	84	87
12	pilot	prof	72	76	83
13	electrician	bc	47	39	53
14	store.manager	prof	42	44	45

Let’s start off trying to predict prestige with income:

prestige = df['prestige']
income = df['income']

plt.plot(income, prestige, '+')
plt.xlabel('Income')
plt.ylabel('Prestige');

_images/1f57592dea00a093d8851676739e2eaaa652e632c337490a8606e40a8f2104ff.png

Let’s immediately rename the values we are predicting to y, and the values we are predicting with to x:

y = prestige
x = income

Set n to the number of observations (observational units, rows):

n = len(prestige)
n

For this page we will start with a preliminary guess for a straight line that links the income scores to the prestige scores. It looks like the prestige scores go up by about 50 as the income scores rise from 10 to 50. The matching intercept looks like it could be around 5.

b = 50 / 40
c = 5

We use this line to predict the prestige scores from the income scores.

predicted = b * income + c
predicted

   100.00
   102.50
    31.25
    20.00
    25.00
    15.00
    47.50
    65.00
    85.00
    16.25
   50.00
   31.25
   95.00
   63.75
   57.50
Name: income, dtype: float64

# Plot the data and the predictions
plt.plot(x, y, '+', label='Actual data')
plt.plot(x, predicted, 'ro', label='Predictions')
plt.xlabel('Income')
plt.ylabel('Prestige score')
plt.title('Income vs Prestige with guessed line')
plt.legend()

<matplotlib.legend.Legend at 0x7f58e0b96250>

_images/b61461d1b4e3eb73fdf1e5318ef3ea273b866b5536eeaf4eb1e3e777ec72a656.png

As you saw in regression notation, we can write out this model more formally and more generally in mathematical symbols.

\(\newcommand{\yvec}{\vec{y}} \newcommand{\xvec}{\vec{x}} \newcommand{\evec}{\vec{\varepsilon}}\)

We write our y (prestige) data \(\yvec\) — a vector with 15 values, one for each profession.

\(y_1\) is the value for the first profession and \(y_i\) is the value for profession \(i\) where \(i \in 1 .. 15\):

\[\begin{split}\displaystyle \vec{y} = \left[\begin{matrix}{y}_{1}\\{y}_{2}\\{y}_{3}\\{y}_{4}\\{y}_{5}\\{y}_{6}\\{y}_{7}\\{y}_{8}\\{y}_{9}\\{y}_{10}\\{y}_{11}\\{y}_{12}\\{y}_{13}\\{y}_{14}\\{y}_{15}\end{matrix}\right] = \left[\begin{matrix}89\\92\\24\\6\\7\\10\\41\\73\\93\\3\\57\\87\\83\\53\\45\end{matrix}\right]\end{split}\]

x (the income score) is a predictor. Lets call the income scores \(\xvec\). \(\xvec\) is another vector with 15 values. \(x_1\) is the value for the first profession and \(x_i\) is the value for profession \(i\) where \(i \in 1 .. 15\).

\[\begin{split}\displaystyle \vec{x} = \left[\begin{matrix}{x}_{1}\\{x}_{2}\\{x}_{3}\\{x}_{4}\\{x}_{5}\\{x}_{6}\\{x}_{7}\\{x}_{8}\\{x}_{9}\\{x}_{10}\\{x}_{11}\\{x}_{12}\\{x}_{13}\\{x}_{14}\\{x}_{15}\end{matrix}\right] = \left[\begin{matrix}76\\78\\21\\12\\16\\8\\34\\48\\64\\9\\36\\21\\72\\47\\42\end{matrix}\right]\end{split}\]

In regression notation, we wrote our straight line model as:

\(y_i = bx_i + c + e_i\)

Simple regression in matrix form#

It turns out it will be useful to rephrase the simple regression model in matrix form. Let’s make the data and predictor and errors into vectors.

We already the \(y\) values in our \(\yvec\) vector above, and the \(x\) values in the \(\xvec\) vector.

\(\evec\) is the vector of as-yet-unknown errors \(e_1, e_2, ... e_{15}\). The values of the errors depend on the predicted values, and therefore, on our slope \(b\) and intercept \(c\).

We calculate the errors as:

e = y - predicted

We write the errors as an error vector:

\[\begin{split}\displaystyle \vec{\varepsilon} = \left[\begin{matrix}{e}_{1}\\{e}_{2}\\{e}_{3}\\{e}_{4}\\{e}_{5}\\{e}_{6}\\{e}_{7}\\{e}_{8}\\{e}_{9}\\{e}_{10}\\{e}_{11}\\{e}_{12}\\{e}_{13}\\{e}_{14}\\{e}_{15}\end{matrix}\right]\end{split}\]

Now we can rephrase our model as:

\[ \yvec = b \xvec + c + \evec \]

\[\begin{split}\displaystyle \left[\begin{matrix}{y}_{1}\\{y}_{2}\\{y}_{3}\\{y}_{4}\\{y}_{5}\\{y}_{6}\\{y}_{7}\\{y}_{8}\\{y}_{9}\\{y}_{10}\\{y}_{11}\\{y}_{12}\\{y}_{13}\\{y}_{14}\\{y}_{15}\end{matrix}\right] = b \left[\begin{matrix}{x}_{1}\\{x}_{2}\\{x}_{3}\\{x}_{4}\\{x}_{5}\\{x}_{6}\\{x}_{7}\\{x}_{8}\\{x}_{9}\\{x}_{10}\\{x}_{11}\\{x}_{12}\\{x}_{13}\\{x}_{14}\\{x}_{15}\end{matrix}\right] + c + \left[\begin{matrix}{e}_{1}\\{e}_{2}\\{e}_{3}\\{e}_{4}\\{e}_{5}\\{e}_{6}\\{e}_{7}\\{e}_{8}\\{e}_{9}\\{e}_{10}\\{e}_{11}\\{e}_{12}\\{e}_{13}\\{e}_{14}\\{e}_{15}\end{matrix}\right]\end{split}\]

Confirm this is true when we calculate on our particular values:

# Confirm that y is close as dammit to b * x + c + e
assert np.allclose(y, b * x + c + e)

Bear with with us for a little trick. We define \(\vec{o}\) as a vector of ones, like this:

\[\begin{split}\displaystyle \vec{o} = \left[\begin{matrix}1\\1\\1\\1\\1\\1\\1\\1\\1\\1\\1\\1\\1\\1\\1\end{matrix}\right]\end{split}\]

In code, in our case:

o = np.ones(n)

Now we can rewrite the formula as:

\[ \yvec = b\xvec + c\vec{o} + \evec \]

because \(o_i = 1\) and so \(co_i = c\).

\[\begin{split}\displaystyle \left[\begin{matrix}{y}_{1}\\{y}_{2}\\{y}_{3}\\{y}_{4}\\{y}_{5}\\{y}_{6}\\{y}_{7}\\{y}_{8}\\{y}_{9}\\{y}_{10}\\{y}_{11}\\{y}_{12}\\{y}_{13}\\{y}_{14}\\{y}_{15}\end{matrix}\right] = b \left[\begin{matrix}{x}_{1}\\{x}_{2}\\{x}_{3}\\{x}_{4}\\{x}_{5}\\{x}_{6}\\{x}_{7}\\{x}_{8}\\{x}_{9}\\{x}_{10}\\{x}_{11}\\{x}_{12}\\{x}_{13}\\{x}_{14}\\{x}_{15}\end{matrix}\right] + c \left[\begin{matrix}1\\1\\1\\1\\1\\1\\1\\1\\1\\1\\1\\1\\1\\1\\1\end{matrix}\right] + \left[\begin{matrix}{e}_{1}\\{e}_{2}\\{e}_{3}\\{e}_{4}\\{e}_{5}\\{e}_{6}\\{e}_{7}\\{e}_{8}\\{e}_{9}\\{e}_{10}\\{e}_{11}\\{e}_{12}\\{e}_{13}\\{e}_{14}\\{e}_{15}\end{matrix}\right]\end{split}\]

This evaluates to the result we intend:

\[\begin{split}\displaystyle \left[\begin{matrix}{y}_{1}\\{y}_{2}\\{y}_{3}\\{y}_{4}\\{y}_{5}\\{y}_{6}\\{y}_{7}\\{y}_{8}\\{y}_{9}\\{y}_{10}\\{y}_{11}\\{y}_{12}\\{y}_{13}\\{y}_{14}\\{y}_{15}\end{matrix}\right] = \left[\begin{matrix}c + b {x}_{1} + {e}_{1}\\c + b {x}_{2} + {e}_{2}\\c + b {x}_{3} + {e}_{3}\\c + b {x}_{4} + {e}_{4}\\c + b {x}_{5} + {e}_{5}\\c + b {x}_{6} + {e}_{6}\\c + b {x}_{7} + {e}_{7}\\c + b {x}_{8} + {e}_{8}\\c + b {x}_{9} + {e}_{9}\\c + b {x}_{10} + {e}_{10}\\c + b {x}_{11} + {e}_{11}\\c + b {x}_{12} + {e}_{12}\\c + b {x}_{13} + {e}_{13}\\c + b {x}_{14} + {e}_{14}\\c + b {x}_{15} + {e}_{15}\end{matrix}\right]\end{split}\]

# Confirm that y is close as dammit to b * x + c * o + e
assert np.allclose(y, b * x + c * o + e)

\(\newcommand{Xmat}{\boldsymbol X} \newcommand{\bvec}{\vec{\beta}}\)

We can now rephrase the calculation in terms of matrix multiplication.

Call \(\Xmat\) the matrix of two columns, where the first column is \(\xvec\) and the second column is the column of ones (\(\vec{o}\) above).

In code, for our case:

X = np.stack([x, o], axis=1)
X

array([[76.,  1.],
       [78.,  1.],
       [21.,  1.],
       [12.,  1.],
       [16.,  1.],
       [ 8.,  1.],
       [34.,  1.],
       [48.,  1.],
       [64.,  1.],
       [ 9.,  1.],
       [36.,  1.],
       [21.,  1.],
       [72.,  1.],
       [47.,  1.],
       [42.,  1.]])

Call \(\bvec\) the column vector:

\[\begin{split} \left[ \begin{array}{\bvec} b \\ c \\ \end{array} \right] \end{split}\]

In code:

B = np.array([b, c])
B

array([1.25, 5.  ])

This gives us exactly the same calculation as \(\yvec = b \xvec + c + \evec\) above, but in terms of matrix multiplication:

\[\begin{split}\displaystyle \left[\begin{matrix}{y}_{1}\\{y}_{2}\\{y}_{3}\\{y}_{4}\\{y}_{5}\\{y}_{6}\\{y}_{7}\\{y}_{8}\\{y}_{9}\\{y}_{10}\\{y}_{11}\\{y}_{12}\\{y}_{13}\\{y}_{14}\\{y}_{15}\end{matrix}\right] = \left[\begin{matrix}{x}_{1} & 1\\{x}_{2} & 1\\{x}_{3} & 1\\{x}_{4} & 1\\{x}_{5} & 1\\{x}_{6} & 1\\{x}_{7} & 1\\{x}_{8} & 1\\{x}_{9} & 1\\{x}_{10} & 1\\{x}_{11} & 1\\{x}_{12} & 1\\{x}_{13} & 1\\{x}_{14} & 1\\{x}_{15} & 1\end{matrix}\right] \left[\begin{matrix}b\\c\end{matrix}\right] + \left[\begin{matrix}{e}_{1}\\{e}_{2}\\{e}_{3}\\{e}_{4}\\{e}_{5}\\{e}_{6}\\{e}_{7}\\{e}_{8}\\{e}_{9}\\{e}_{10}\\{e}_{11}\\{e}_{12}\\{e}_{13}\\{e}_{14}\\{e}_{15}\end{matrix}\right]\end{split}\]

In symbols:

\[ \yvec = \Xmat \bvec + \evec \]

Because of the way that matrix multiplication works, this again gives us the intended result:

\[\begin{split}\displaystyle \left[\begin{matrix}{y}_{1}\\{y}_{2}\\{y}_{3}\\{y}_{4}\\{y}_{5}\\{y}_{6}\\{y}_{7}\\{y}_{8}\\{y}_{9}\\{y}_{10}\\{y}_{11}\\{y}_{12}\\{y}_{13}\\{y}_{14}\\{y}_{15}\end{matrix}\right] = \left[\begin{matrix}c + b {x}_{1} + {e}_{1}\\c + b {x}_{2} + {e}_{2}\\c + b {x}_{3} + {e}_{3}\\c + b {x}_{4} + {e}_{4}\\c + b {x}_{5} + {e}_{5}\\c + b {x}_{6} + {e}_{6}\\c + b {x}_{7} + {e}_{7}\\c + b {x}_{8} + {e}_{8}\\c + b {x}_{9} + {e}_{9}\\c + b {x}_{10} + {e}_{10}\\c + b {x}_{11} + {e}_{11}\\c + b {x}_{12} + {e}_{12}\\c + b {x}_{13} + {e}_{13}\\c + b {x}_{14} + {e}_{14}\\c + b {x}_{15} + {e}_{15}\end{matrix}\right]\end{split}\]

We write matrix multiplication in Numpy as @:

# Confirm that y is close as dammit to X @ B + e
assert np.allclose(y, X @ B + e)

We still haven’t found our best fitting line. But before we go further, it might be clear that we can easily add a new predictor here.

Multiple regression#

We might also believe that prestige increases with the education level associated with that profession.

Let us add education level as another predictor.

education = df['education']

Now rename the income predictor vector from \(\xvec\) to \(\xvec_1\).

# income (our previous x) is the first column we will use to predict.
x_1 = income

Of course \(\xvec_1\) has 15 values \(x_{1, 1}..x_{1, 15}\). Call the education predictor vector \(\xvec_2\).

# education is the second column we use to predict
x_2 = education

Call the slope for income \(b_1\) (slope for \(\xvec_1\)). Call the slope for education \(b_2\) (slope for \(\xvec_2\)). Our model is:

\[ y_i = b_1 x_{1, i} + b_2 x_{2, i} + c + e_i \]

\[\begin{split}\displaystyle \left[\begin{matrix}{y}_{1}\\{y}_{2}\\{y}_{3}\\{y}_{4}\\{y}_{5}\\{y}_{6}\\{y}_{7}\\{y}_{8}\\{y}_{9}\\{y}_{10}\\{y}_{11}\\{y}_{12}\\{y}_{13}\\{y}_{14}\\{y}_{15}\end{matrix}\right] = b_{1} \left[\begin{matrix}{x_{1,}}_{1}\\{x_{1,}}_{2}\\{x_{1,}}_{3}\\{x_{1,}}_{4}\\{x_{1,}}_{5}\\{x_{1,}}_{6}\\{x_{1,}}_{7}\\{x_{1,}}_{8}\\{x_{1,}}_{9}\\{x_{1,}}_{10}\\{x_{1,}}_{11}\\{x_{1,}}_{12}\\{x_{1,}}_{13}\\{x_{1,}}_{14}\\{x_{1,}}_{15}\end{matrix}\right] + b_{2} \left[\begin{matrix}{x_{2,}}_{1}\\{x_{2,}}_{2}\\{x_{2,}}_{3}\\{x_{2,}}_{4}\\{x_{2,}}_{5}\\{x_{2,}}_{6}\\{x_{2,}}_{7}\\{x_{2,}}_{8}\\{x_{2,}}_{9}\\{x_{2,}}_{10}\\{x_{2,}}_{11}\\{x_{2,}}_{12}\\{x_{2,}}_{13}\\{x_{2,}}_{14}\\{x_{2,}}_{15}\end{matrix}\right] + c + \left[\begin{matrix}{e}_{1}\\{e}_{2}\\{e}_{3}\\{e}_{4}\\{e}_{5}\\{e}_{6}\\{e}_{7}\\{e}_{8}\\{e}_{9}\\{e}_{10}\\{e}_{11}\\{e}_{12}\\{e}_{13}\\{e}_{14}\\{e}_{15}\end{matrix}\right]\end{split}\]

In this model \(\Xmat\) has three columns (\(\xvec_1\), \(\xvec_2\) and ones), and the \(\bvec\) vector has three values \(b_1, b_2, c\). This gives the same matrix formulation, with our new \(\Xmat\) and \(\bvec\): \(\yvec = \Xmat \bvec + \evec\).

This is a linear model because our model says that the data \(y_i\) comes from the sum of some components (\(b_1 x_{1, i}, b_2 x_{2, i}, c, e_i\)).

We can keep doing this by adding more and more regressors. In general, a linear model with \(p\) predictors looks like this:

\[ y_i = b_1 x_{1, i} + b_2 x_{2, i} + ... b_p x_{p, i} + e_i \]

In the case of the models above, the final predictor \(\xvec_p\) would be a column of ones, to express the intercept in the model.

Any model of the form above can still be phrased in the matrix form:

\[ \yvec = \Xmat \bvec + \evec \]

where:

\[\begin{split}\displaystyle \boldsymbol{X} = \left[\begin{matrix}{x_{1,}}_{1} & {x_{2,}}_{1} & 1\\{x_{1,}}_{2} & {x_{2,}}_{2} & 1\\{x_{1,}}_{3} & {x_{2,}}_{3} & 1\\{x_{1,}}_{4} & {x_{2,}}_{4} & 1\\{x_{1,}}_{5} & {x_{2,}}_{5} & 1\\{x_{1,}}_{6} & {x_{2,}}_{6} & 1\\{x_{1,}}_{7} & {x_{2,}}_{7} & 1\\{x_{1,}}_{8} & {x_{2,}}_{8} & 1\\{x_{1,}}_{9} & {x_{2,}}_{9} & 1\\{x_{1,}}_{10} & {x_{2,}}_{10} & 1\\{x_{1,}}_{11} & {x_{2,}}_{11} & 1\\{x_{1,}}_{12} & {x_{2,}}_{12} & 1\\{x_{1,}}_{13} & {x_{2,}}_{13} & 1\\{x_{1,}}_{14} & {x_{2,}}_{14} & 1\\{x_{1,}}_{15} & {x_{2,}}_{15} & 1\end{matrix}\right]\end{split}\]

In code in our case:

# Design X in values
X_3_cols = np.stack([x_1, x_2, o], axis=1)
X_3_cols

array([[76., 98.,  1.],
       [78., 82.,  1.],
       [21., 20.,  1.],
       [12., 30.,  1.],
       [16., 28.,  1.],
       [ 8., 32.,  1.],
       [34., 47.,  1.],
       [48., 91.,  1.],
       [64., 93.,  1.],
       [ 9., 17.,  1.],
       [36., 32.,  1.],
       [21., 84.,  1.],
       [72., 76.,  1.],
       [47., 39.,  1.],
       [42., 44.,  1.]])

In symbols:

\[\begin{split}\displaystyle \left[\begin{matrix}{y}_{1}\\{y}_{2}\\{y}_{3}\\{y}_{4}\\{y}_{5}\\{y}_{6}\\{y}_{7}\\{y}_{8}\\{y}_{9}\\{y}_{10}\\{y}_{11}\\{y}_{12}\\{y}_{13}\\{y}_{14}\\{y}_{15}\end{matrix}\right] = \left[\begin{matrix}{x_{1,}}_{1} & {x_{2,}}_{1} & 1\\{x_{1,}}_{2} & {x_{2,}}_{2} & 1\\{x_{1,}}_{3} & {x_{2,}}_{3} & 1\\{x_{1,}}_{4} & {x_{2,}}_{4} & 1\\{x_{1,}}_{5} & {x_{2,}}_{5} & 1\\{x_{1,}}_{6} & {x_{2,}}_{6} & 1\\{x_{1,}}_{7} & {x_{2,}}_{7} & 1\\{x_{1,}}_{8} & {x_{2,}}_{8} & 1\\{x_{1,}}_{9} & {x_{2,}}_{9} & 1\\{x_{1,}}_{10} & {x_{2,}}_{10} & 1\\{x_{1,}}_{11} & {x_{2,}}_{11} & 1\\{x_{1,}}_{12} & {x_{2,}}_{12} & 1\\{x_{1,}}_{13} & {x_{2,}}_{13} & 1\\{x_{1,}}_{14} & {x_{2,}}_{14} & 1\\{x_{1,}}_{15} & {x_{2,}}_{15} & 1\end{matrix}\right] \left[\begin{matrix}b_{1}\\b_{2}\\c\end{matrix}\right] + \left[\begin{matrix}{e}_{1}\\{e}_{2}\\{e}_{3}\\{e}_{4}\\{e}_{5}\\{e}_{6}\\{e}_{7}\\{e}_{8}\\{e}_{9}\\{e}_{10}\\{e}_{11}\\{e}_{12}\\{e}_{13}\\{e}_{14}\\{e}_{15}\end{matrix}\right]\end{split}\]

As we were hoping, this evaluates to:

\[\begin{split}\displaystyle \left[\begin{matrix}{y}_{1}\\{y}_{2}\\{y}_{3}\\{y}_{4}\\{y}_{5}\\{y}_{6}\\{y}_{7}\\{y}_{8}\\{y}_{9}\\{y}_{10}\\{y}_{11}\\{y}_{12}\\{y}_{13}\\{y}_{14}\\{y}_{15}\end{matrix}\right] = \left[\begin{matrix}c + b_{1} {x_{1,}}_{1} + b_{2} {x_{2,}}_{1} + {e}_{1}\\c + b_{1} {x_{1,}}_{2} + b_{2} {x_{2,}}_{2} + {e}_{2}\\c + b_{1} {x_{1,}}_{3} + b_{2} {x_{2,}}_{3} + {e}_{3}\\c + b_{1} {x_{1,}}_{4} + b_{2} {x_{2,}}_{4} + {e}_{4}\\c + b_{1} {x_{1,}}_{5} + b_{2} {x_{2,}}_{5} + {e}_{5}\\c + b_{1} {x_{1,}}_{6} + b_{2} {x_{2,}}_{6} + {e}_{6}\\c + b_{1} {x_{1,}}_{7} + b_{2} {x_{2,}}_{7} + {e}_{7}\\c + b_{1} {x_{1,}}_{8} + b_{2} {x_{2,}}_{8} + {e}_{8}\\c + b_{1} {x_{1,}}_{9} + b_{2} {x_{2,}}_{9} + {e}_{9}\\c + b_{1} {x_{1,}}_{10} + b_{2} {x_{2,}}_{10} + {e}_{10}\\c + b_{1} {x_{1,}}_{11} + b_{2} {x_{2,}}_{11} + {e}_{11}\\c + b_{1} {x_{1,}}_{12} + b_{2} {x_{2,}}_{12} + {e}_{12}\\c + b_{1} {x_{1,}}_{13} + b_{2} {x_{2,}}_{13} + {e}_{13}\\c + b_{1} {x_{1,}}_{14} + b_{2} {x_{2,}}_{14} + {e}_{14}\\c + b_{1} {x_{1,}}_{15} + b_{2} {x_{2,}}_{15} + {e}_{15}\end{matrix}\right]\end{split}\]

Population, sample, estimate#

\(\newcommand{\bhat}{\hat{\bvec}} \newcommand{\yhat}{\hat{\yvec}}\) Our professions and their prestige scores are a sample from the population of all professions’ prestige scores. The parameters \(\bvec\) are the parameters that fit the design \(\Xmat\) to the population scores. We only have a sample from this population, so we cannot get the true population \(\bvec\) vector, we can only estimate \(\bvec\) from our sample. We will write this sample estimate as \(\bhat\) to distinguish it from the true population parameters \(\bvec\).

Solving the model with matrix algebra#

The reason to formulate our problem with matrices is so we can use some basic matrix algebra to estimate the “best” line.

Let’s assume that we want an estimate for the line parameters (intercept and slope) that gives the smallest “distance” between the estimated values (predicted from the line), and the actual values (the data).

We’ll define ‘distance’ as the squared difference of the predicted value from the actual value. These are the squared error terms \(e_1^2, e_2^2 ... e_{n}^2\), where \(n\) is the number of observations; 15 in our case.

As a reminder: our model is:

\[ \yvec = \Xmat \bvec + \evec \]

Where \(\yvec\) is the data vector \(y_1, y_2, ... y_n\), \(\Xmat\) is the design matrix of shape \(n, p\), \(\bvec\) is the parameter vector, \(b_1, b_2 ... b_p\), and \(\evec\) is the error vector giving errors for each observation \(\epsilon_1, \epsilon_2 ... \epsilon_n\).

Each column of \(\Xmat\) is a regressor vector, so \(\Xmat\) can be thought of as the column concatenation of \(p\) vectors \(\xvec_1, \xvec_2 ... \xvec_p\), where \(\xvec_1\) is the first regressor vector, and so on.

In our case, we want an estimate \(\bhat\) for the vector \(\bvec\) such that the errors \(\evec = \yvec - \Xmat \bhat\) have the smallest sum of squares \(\sum_{i=1}^n{e_i^2}\). \(\sum_{i=1}^n{e_i^2}\) is called the residual sum of squares.

When we have our \(\bhat\) estimate, then the prediction of the data from the estimate is given by \(\Xmat \bhat\).

We call this the predicted or estimated data, and write it as \(\yhat\). The errors are then given by \(\yvec - \yhat\).

We might expect that, when we have found the right \(\bhat\) then the errors will have nothing in them that can still be explained by the design matrix \(\Xmat\). This is the same as saying that, when we have best prediction of the data (\(\yhat = \Xmat \bhat\)), the design matrix \(\Xmat\) should be orthogonal to the remaining error (\(\yvec - \yhat\)).

For those of you who have learned about vectors in mathematics, you may remember that we say that two vectors are orthogonal if they have a dot product of zero.

If the design is orthogonal to the errors, then \(\Xmat^T \evec\) should be a vector of zeros.

If that is the case then we can multiply \(\yvec = \Xmat \bhat + \evec\) through by \(\Xmat^T\):

\[ \Xmat^T \yvec = \Xmat^T \Xmat \bhat + \Xmat^T \evec \]

The last term now disappears because it is zero and:

\[ \Xmat^T \yvec = \Xmat^T \Xmat \bhat \]

If \(\Xmat^T \Xmat\) is invertible (has a matrix inverse \((\Xmat^T \Xmat)^{-1}\)) then there is a unique solution:

\[ \bhat = (\Xmat^T \Xmat)^{-1} \Xmat^T \yvec \]

It turns out that, if \(\Xmat^T \Xmat\) is not invertible, there are an infinite number of solutions, and we have to choose one solution, taking into account that the parameters \(\bhat\) will depend on which solution we chose. The pseudoinverse operator gives us one particular solution. If \(\bf{A}^+\) is the pseudoinverse of matrix \(\bf{A}\) then the general solution for \(\bhat\), even when \(\Xmat^T \Xmat\) is not invertible, is:

\[ \bhat = \Xmat^+ \yvec \]

Using this matrix algebra, what line do we estimate for prestige and income?

X = np.column_stack((income, np.ones(n)))
X

array([[76.,  1.],
       [78.,  1.],
       [21.,  1.],
       [12.,  1.],
       [16.,  1.],
       [ 8.,  1.],
       [34.,  1.],
       [48.,  1.],
       [64.,  1.],
       [ 9.,  1.],
       [36.,  1.],
       [21.,  1.],
       [72.,  1.],
       [47.,  1.],
       [42.,  1.]])

# Use the pseudoinverse to get estimated B
B = npl.pinv(X) @ prestige
B

array([1.181791, 4.855621])

# Plot the data
plt.plot(income, prestige, '+')

[<matplotlib.lines.Line2D at 0x7f58a2a0c310>]

_images/be51597098698965470eec6cfd34ab4a92e61400bb338d4c655d6554137dcfa9.png

best_slope = B[0]
best_intercept = B[1]
best_predictions = best_intercept + best_slope * income

# Plot the new prediction
plt.plot(income, prestige, '+', label='Actual data')
plt.plot(income, best_predictions, 'ro', label='Best predictions')
plt.xlabel('Income')
plt.ylabel('Prestige score')
plt.title('Income vs Prestige with best line')
plt.legend();

_images/978ceea808a86c316f2ad62bfd531702b23959fff42081ce9cdcd697e511ffb7.png

Our claim was that this estimate for slope and intercept minimize the sum (and therefore mean) of squared error:

fitted = X @ B
errors = prestige - fitted
print(np.sum(errors ** 2))

4590.079332315554

Is this sum of squared errors smaller than our earlier guess of an intercept of 10 and a slope of 0.9?

fitted = X @ [0.9, 10]
errors = prestige - fitted
print(np.sum(errors ** 2))

5778.959999999999

Displaying the design matrix as an image#

We can show the design as an image, by scaling the values with columns.

We scale within columns because we care more about seeing variation within the regressor than between regressors. For example, if we have a regressor varying between 0 and 1, and another between 0 and 1000, without scaling, the column with the larger numbers will swamp the variation in the column with the smaller numbers.

def scale_design_mtx(X):
    """utility to scale the design matrix for display

    This scales the columns to their own range so we can see the variations
    across the column for all the columns, regardless of the scaling of the
    column.
    """
    mi, ma = np.min(X, axis=0), np.max(X, axis=0)
    # Vector that is True for columns where values are not
    # all almost equal to each other
    col_neq = (ma - mi) > 1.e-8
    Xs = np.ones_like(X)
    # Leave columns with same value throughout with 1s
    # Scale other columns to min, max in column
    mi = mi[col_neq]
    ma = ma[col_neq]
    Xs[:,col_neq] = (X[:,col_neq] - mi)/(ma - mi)
    return Xs

Then we can display the scaled design with a title and some default image display parameters, to see our design at a glance.

plt.imshow(scale_design_mtx(X), cmap='gray')
plt.title('Simple regression model');

_images/778a3ea05ff4ecddae66e1d176352840cbe2fcc0e141c89508af26c753d463ab.png

Expressing the linear model with matrices

Contents

Expressing the linear model with matrices#

Simple regression#

Simple regression in matrix form#

Multiple regression#

Population, sample, estimate#

Solving the model with matrix algebra#

Displaying the design matrix as an image#