Fitting models with different cost functions

Fitting models with different cost functions#

We also get on to cross-validation.

import numpy as np
np.set_printoptions(suppress=True)
from scipy.optimize import minimize
import pandas as pd
pd.set_option('mode.copy_on_write', True)
import statsmodels.formula.api as smf
import sklearn.linear_model as sklm
import sklearn.metrics as skmetrics

df = pd.read_csv('data/rate_my_course.csv')
df

	Discipline	Number of Professors	Clarity	Helpfulness	Overall Quality	Easiness
0	English	23343	3.756147	3.821866	3.791364	3.162754
1	Mathematics	22394	3.487379	3.641526	3.566867	3.063322
2	Biology	11774	3.608331	3.701530	3.657641	2.710459
3	Psychology	11179	3.909520	3.887536	3.900949	3.316210
4	History	11145	3.788818	3.753642	3.773746	3.053803
...	...	...	...	...	...	...
70	Anatomy	117	3.932991	3.974701	3.954188	2.863504
71	Earth Science	110	3.636182	3.671364	3.655091	3.106727
72	Linguistics	110	3.749000	3.834545	3.798182	3.309636
73	Mechanical Engineering	104	3.441923	3.531154	3.489327	2.799135
74	Medicine	102	3.927255	3.934216	3.929118	3.109118

75 rows × 6 columns

Fetch some columns of interest:

# This will be our y (the variable we predict).
helpfulness = df['Helpfulness']
# One of both of these will be our X (the predictors).
clarity_easiness = df[['Clarity', 'Easiness']]

Fit the model with Statsmodels.

sm_model = smf.ols('Helpfulness ~ Clarity + Easiness', data=df)
sm_fit = sm_model.fit()
sm_fit.summary()

OLS Regression Results
Dep. Variable:	Helpfulness	R-squared:	0.913
Model:	OLS	Adj. R-squared:	0.911
Method:	Least Squares	F-statistic:	378.0
Date:	Tue, 03 Sep 2024	Prob (F-statistic):	6.53e-39
Time:	10:05:26	Log-Likelihood:	118.24
No. Observations:	75	AIC:	-230.5
Df Residuals:	72	BIC:	-223.5
Df Model:	2
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	0.6342	0.116	5.483	0.000	0.404	0.865
Clarity	0.8477	0.047	18.160	0.000	0.755	0.941
Easiness	-0.0063	0.034	-0.183	0.855	-0.075	0.062

Omnibus:	3.832	Durbin-Watson:	1.847
Prob(Omnibus):	0.147	Jarque-Bera (JB):	3.429
Skew:	0.524	Prob(JB):	0.180
Kurtosis:	3.032	Cond. No.	103.

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Fit the same model with Scikit-learn.

sk_model = sklm.LinearRegression()
sk_fit = sk_model.fit(clarity_easiness, helpfulness)
sk_fit

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

The coefficients (the slopes for the regressors):

sk_fit.coef_

array([ 0.84765719, -0.00628188])

The intercept:

sk_fit.intercept_

np.float64(0.6342163349072085)

Compare the parameters to Statsmodels:

sm_fit.params

Intercept    0.634216
Clarity      0.847657
Easiness    -0.006282
dtype: float64

The fitted values and Scikit-learn#

The values predicted by the (Sklearn) model:

y_hat = sk_fit.predict(clarity_easiness)
y_hat

array([3.79827334, 3.57107499, 3.67581734, 3.92731675, 3.82665181,
       3.48871883, 3.89117061, 3.70003921, 3.8013664 , 3.48333495,
       3.69359333, 3.78395243, 3.47336086, 3.57017191, 3.81165212,
       3.59953202, 3.40917203, 3.74439846, 3.75529281, 3.87079006,
       3.71650243, 3.91081494, 3.84021344, 3.46140189, 4.05110024,
       3.69790129, 3.68125641, 3.73563421, 3.66975401, 3.76540619,
       3.77108961, 3.89775444, 3.81443707, 3.45401091, 3.71363585,
       3.84224058, 4.06748055, 3.80755826, 3.53711396, 3.7465973 ,
       3.68031767, 3.68119022, 3.88083408, 3.8323017 , 3.60821429,
       3.82505099, 3.69906336, 3.97423909, 3.96735067, 4.11466129,
       3.80967881, 3.60364526, 3.63462294, 3.68518199, 3.74639111,
       3.47842488, 3.76251001, 3.60481205, 3.93161439, 3.94669493,
       3.54215346, 3.96030499, 3.74421691, 3.81391267, 4.13556455,
       3.88432985, 3.80636234, 3.99835993, 3.84638216, 3.69029801,
       3.95005664, 3.69693593, 3.79129242, 3.53420337, 3.9436511 ])

Compare the fitted (\(\hat{y}\)) values to those from Statsmodels:

sm_fit.predict() - y_hat

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0.])

Here’s a general check that the predictions are all the same (or close enough within computational precision):

assert np.allclose(sm_fit.predict(), y_hat)

We assemble Sklearn’s coefficients and intercept into a single list, with the intercept last.

params = list(sk_fit.coef_) + [sk_fit.intercept_]
params

[np.float64(0.8476571912478371),
 np.float64(-0.006281876019590555),
 np.float64(0.6342163349072085)]

If we just want all but the last parameter (all the coefficients, but not the intercept):

# All parameters but the last (all but the intercept parameter).
params[:-1]

[np.float64(0.8476571912478371), np.float64(-0.006281876019590555)]

We could also get the parameters from Statsmodels, but we’d have to rearrange them, because Statsmodels puts the intercept first rather than last:

sm_fit.params.iloc[[1, 2, 0]]

Clarity      0.847657
Easiness    -0.006282
Intercept    0.634216
dtype: float64

Write a function to compute the fitted values given the parameters:

def calc_fitted(params, X):
    """ Calculate fitted values from design X and parameters

    Parameters
    ----------
    params : vector (1D array)
        Vector of parameters, intercept is last parameter.
    X : array
        2D array with regressor columns.

    Returns
    -------
    y_hat : vector
        Vector of fitted values
    """
    X = np.array(X)
    n, p = X.shape
    y_hat = np.zeros(n)
    for col_no, param in enumerate(params[:-1]):
        y_hat = y_hat + param * X[:, col_no]  # Add contribution from this regressor.
    y_hat = y_hat + params[-1]  # Add contribution from intercept.
    return y_hat

Show that we get the same fitted values from our function as we got from the predict method of Sklearn:

our_y_hat = calc_fitted(params, clarity_easiness)
assert np.allclose(our_y_hat, y_hat)

Calculate the error vector and then calculate the sum of squared error:

e = helpfulness - our_y_hat
np.sum(e ** 2)

np.float64(0.18759984412745978)

Make a function to calculate sum of squared error:

def sos(params, X, y):
    """ Sum of squared error for `params` given model `X` and data `y`.
    """
    y_hat = calc_fitted(params, X)
    e = y - y_hat  # residuals
    return np.sum(e ** 2)

Check that we get the same answer from the function as we got from calculating above:

sos(params, clarity_easiness, helpfulness)

np.float64(0.18759984412745978)

Use minimize to find parameters minimizing the sum of squared error:

min_res = minimize(sos, [0, 0, 0], args=(clarity_easiness, helpfulness))
min_res

  message: Optimization terminated successfully.
  success: True
   status: 0
      fun: 0.18759984414069664
        x: [ 8.477e-01 -6.281e-03  6.342e-01]
      nit: 8
      jac: [ 3.278e-07  2.831e-07  1.155e-07]
 hess_inv: [[ 4.169e-01 -2.313e-01 -8.092e-01]
            [-2.313e-01  2.249e-01  1.414e-01]
            [-8.092e-01  1.414e-01  2.557e+00]]
     nfev: 40
     njev: 10

Yes, the parameters (coefficients and intercept) are the same as we got from Statsmodels and Sklearn:

min_res.x

array([ 0.84765427, -0.00628103,  0.63422445])

A short diversion — writing the mean in symbols#

For notational convenience, give our y (“Helpfulness”) vector the variable name y:

y = helpfulness

This is the usual mean, on y (the ‘Helpfulness” scores):

np.mean(y)

np.float64(3.758165438295424)

The calculation is:

n = len(y)
np.sum(y) / n

np.float64(3.758165438295424)

Of course this is the same as:

1 / n * np.sum(y)

np.float64(3.758165438295424)

In mathematical notation, we write this as:

\[\begin{split} \bar{y} = \frac{1}{n} (y_1 + y_2 + ... + y_n) \\ \end{split}\]

where \(\bar{y}\) is the mean of the values in the vector \(\vec{y}\).

We could also write the operation of adding up the \(y\) values with the \(\sum\) notation:

\[ \sum_i y_i = (y_1 + y_2 + ... + y_n) \]

Using the \(\sum\) notation, we write the mean as:

\[ \bar{y} = \frac{1}{n} \sum_i y_i \]

This notation is very similar to the equivalent code:

1 / n * np.sum(y)

np.float64(3.758165438295424)

The long and the short of \(R^2\)#

\(R^2\) can also be called the coefficient of determination.

Sklearn has an r2_score metric.

skmetrics.r2_score(helpfulness, y_hat)

0.9130445684300285

The Statsmodels fit object has an rsquared attribute with the \(R^2\) value:

sm_fit.rsquared

np.float64(0.9130445684300286)

The formula for \(R^2\) is:

\[ R^2 = 1 - {SS_{\rm resid}\over SS_{\rm total}} \]

If \(\hat{y}_i\) is the fitted value for observation \(i\), then the residual error \(e_i\) for observation \(i\) is \(y_i - \hat{y}_i\), and \(SS_{\text{resid}}\) is the sum of squares of the residuals:

\[ SS_\text{resid}=\sum_i (y_i - \hat{y}_i)^2=\sum_i e_i^2\ \]

ss_resid = np.sum((y - y_hat) ** 2)
ss_resid

np.float64(0.18759984412745978)

As you saw above, we write the mean of the observed values \(y\) as \(\bar{y}\).

The denominator for the standard \(R^2\) statistic is the “total sum of squares” \(SS_\text{tot}\), written as:

\[ SS_\text{total}=\sum_i (y_i - \bar{y})^2 \]

ss_total = np.sum((y - np.mean(y)) ** 2)
ss_total

np.float64(2.157425254988258)

We can calculate \(R^2\) by hand to show this gives the same answer as Sklearn and Statsmodels.

# Calculate R2
1 - ss_resid / ss_total

np.float64(0.9130445684300285)

The by-hand calculation gives the same answer as Sklearn or Statsmodels.

sm_fit.rsquared

np.float64(0.9130445684300286)

\(R^2\) for another, reduced model#

You can think of the standard \(R^2\) above as a measure of the improvement in fit for an expanded model (with the regressors included) compared to a reduced model (that only uses the mean).

You can use the same idea to compare an expanded model to another reduced model.

For example, we can fit a reduced model that is just “Clarity” without “Easiness”.

clarity_df = df[['Clarity']]
reduced_fit = sklm.LinearRegression().fit(clarity_df, helpfulness)
reduced_fit.coef_, reduced_fit.intercept_

(array([0.84119015]), np.float64(0.6381903166650691))

Calculate \(R^2\) for reduced model as compared to our full model.

reduced_resid = helpfulness - reduced_fit.predict(clarity_df)
ss_reduced_resid = (reduced_resid ** 2).sum()
ss_reduced_resid

np.float64(0.18768752586074872)

1 - ss_resid / ss_reduced_resid

np.float64(0.0004671686777627526)

On weights, and a weighted mean#

You remember, from the discussion of the mean, above, that we can write the mean calculation as:

\[ \bar{y} = \frac{1}{n} \sum_i y_i \]

In code:

1 / n * np.sum(y)

np.float64(3.758165438295424)

Mathematically, because \(p * (q + r) = p * q + p * r\) (the distributive property of multiplication), we can also do the multiplication by \(\frac{1}{n}\) inside the brackets, like this:

\[ \bar{y} = \frac{1}{n} y_1 + \frac{1}{n} y_2 + ... \frac{1}{n} y_n \]

With the \(\sum\) notation, that would be:

\[ \bar{y} = \sum_i \frac{1}{n} y_i \]

In code:

np.sum(1 / n * y)

np.float64(3.7581654382954235)

Think of this - the standard calculation of the mean - as giving each value in \(y\) the same weight - of \(\frac{1}{n}\):

weights = np.ones(n) / n
weights

array([0.01333333, 0.01333333, 0.01333333, 0.01333333, 0.01333333,
       0.01333333, 0.01333333, 0.01333333, 0.01333333, 0.01333333,
       0.01333333, 0.01333333, 0.01333333, 0.01333333, 0.01333333,
       0.01333333, 0.01333333, 0.01333333, 0.01333333, 0.01333333,
       0.01333333, 0.01333333, 0.01333333, 0.01333333, 0.01333333,
       0.01333333, 0.01333333, 0.01333333, 0.01333333, 0.01333333,
       0.01333333, 0.01333333, 0.01333333, 0.01333333, 0.01333333,
       0.01333333, 0.01333333, 0.01333333, 0.01333333, 0.01333333,
       0.01333333, 0.01333333, 0.01333333, 0.01333333, 0.01333333,
       0.01333333, 0.01333333, 0.01333333, 0.01333333, 0.01333333,
       0.01333333, 0.01333333, 0.01333333, 0.01333333, 0.01333333,
       0.01333333, 0.01333333, 0.01333333, 0.01333333, 0.01333333,
       0.01333333, 0.01333333, 0.01333333, 0.01333333, 0.01333333,
       0.01333333, 0.01333333, 0.01333333, 0.01333333, 0.01333333,
       0.01333333, 0.01333333, 0.01333333, 0.01333333, 0.01333333])

np.sum(y * weights)

np.float64(3.7581654382954235)

We can calculate weights to weight each “Helpfulness” value by the number of professors. Maybe we would do this on the basis that larger number of professors may give more reliable values, and should therefore contribute relatively more to the mean.

n_professors = df['Number of Professors']
n_professors

   23343
   22394
   11774
   11179
   11145
      ...  
    117
    110
    110
    104
    102
Name: Number of Professors, Length: 75, dtype: int64

total_n_professors = np.sum(n_professors)
total_n_professors

np.int64(183894)

Calculate the proportion of professors represented by each discipline:

prop_professors_by_subject = n_professors / total_n_professors
prop_professors_by_subject

   0.126937
   0.121777
   0.064026
   0.060790
   0.060606
        ...   
  0.000636
  0.000598
  0.000598
  0.000566
  0.000555
Name: Number of Professors, Length: 75, dtype: float64

Now we have divided each value by the sum of the values, to give the proportions, the proportions must add up to 1:

np.sum(prop_professors_by_subject)

np.float64(1.0)

Calculate weighted mean, where the weights are the proportions of professors in the matching discipline:

# Weighted mean of helpfulness (y), weighted by number of professors.
np.sum(y * prop_professors_by_subject)

np.float64(3.7234004372083844)

Numpy’s version of same:

np.average(y, weights=prop_professors_by_subject)

np.float64(3.7234004372083844)

And in fact, Numpy will check whether the weights sum to 1, and if not, will automatically divide the weights by their sum, so we can get the same calculation with the raw numbers of professors:

np.average(y, weights=n_professors)

np.float64(3.723400437208384)

Fitting with weights#

Weighted regression differs from standard regression, only in weighting each squared residual \(e^2_i = (y_i - \hat{y_i})^2\) by some weight for that observation — call this \(w_i\).

As we saw above in standard regression — AKA “ordinary least squares” regression, the cost function is the sum of squares of the residuals (\(SS_{resid}\)):

\[\begin{split} SS_\text{resid}=\sum_i (y_i - \hat{y}_i)^2 \\ =\sum_i e_i^2 \end{split}\]

In weighted regression we have a weight \(w_i\) for each observation, and the cost function is:

\[\begin{split} SS_\text{weighted_resid}=\sum_i w_i (y_i - \hat{y}_i)^2 \\ =\sum_i w_i e_i^2 \end{split}\]

Also see Wikipedia on weighted regression.

First we show weighted regression in action, and then show how this cost function works in code, with minimize.

Here we use Statsmodels to do weighted regression.

sm_weighted_model = smf.wls('Helpfulness ~ Clarity + Easiness',
                            weights=prop_professors_by_subject,
                            data=df)
sm_weighted_fit = sm_weighted_model.fit()
sm_weighted_fit.summary()

WLS Regression Results
Dep. Variable:	Helpfulness	R-squared:	0.898
Model:	WLS	Adj. R-squared:	0.895
Method:	Least Squares	F-statistic:	316.8
Date:	Tue, 03 Sep 2024	Prob (F-statistic):	2.06e-36
Time:	10:05:27	Log-Likelihood:	90.990
No. Observations:	75	AIC:	-176.0
Df Residuals:	72	BIC:	-169.0
Df Model:	2
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	1.0847	0.105	10.293	0.000	0.875	1.295
Clarity	0.7139	0.041	17.596	0.000	0.633	0.795
Easiness	0.0080	0.031	0.257	0.798	-0.054	0.070

Omnibus:	13.040	Durbin-Watson:	1.729
Prob(Omnibus):	0.001	Jarque-Bera (JB):	21.801
Skew:	0.619	Prob(JB):	1.84e-05
Kurtosis:	5.333	Cond. No.	108.

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

To save some typing, and for notational convenience in code, call our two-regressor design X:

X = clarity_easiness
X

	Clarity	Easiness
0	3.756147	3.162754
1	3.487379	3.063322
2	3.608331	2.710459
3	3.909520	3.316210
4	3.788818	3.053803
...	...	...
70	3.932991	2.863504
71	3.636182	3.106727
72	3.749000	3.309636
73	3.441923	2.799135
74	3.927255	3.109118

75 rows × 2 columns

You can also do weighted regression in Sklearn, by passing weights to the LinearRegression fit method:

sk_weighted = sklm.LinearRegression().fit(
    X,  # clarity_easiness
    y,  # Helpfulness
    sample_weight=prop_professors_by_subject)
sk_weighted

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

sk_weighted.coef_, sk_weighted.intercept_

(array([0.7138681 , 0.00799134]), np.float64(1.0846520953764385))

The minimize cost function for weighted regression:

def sos_weighted(params, X, y, weights):
    """ Weighted least squares cost function
    """
    y_hat = calc_fitted(params, X)
    e = y - y_hat  # Vector of residuals (errors)
    e2 = e ** 2  # The vector of squared errors.
    return np.sum(e2 * weights)

weighted_res = minimize(sos_weighted, [0, 0, 0], args=(X, y, prop_professors_by_subject))
weighted_res

  message: Optimization terminated successfully.
  success: True
   status: 0
      fun: 0.0018148520779416857
        x: [ 7.139e-01  7.992e-03  1.085e+00]
      nit: 11
      jac: [-7.451e-09  2.285e-09  1.451e-08]
 hess_inv: [[ 3.246e+01 -1.768e+01 -6.389e+01]
            [-1.768e+01  1.918e+01  5.111e+00]
            [-6.389e+01  5.111e+00  2.186e+02]]
     nfev: 60
     njev: 15

weighted_res.x

array([0.71386388, 0.00799227, 1.08466463])

Notice these are the same (within very close limits) to the parameters we got from Statsmodels and Sklearn:

sk_weighted.coef_, sk_weighted.intercept_

(array([0.7138681 , 0.00799134]), np.float64(1.0846520953764385))

We can get minimize even closer by setting the tol value for minimize to be a very small number. tol specifies how small the difference has to be when making small changes in the parameters, before minimize accepts it is close enough to the right answer and stops. Put another way, it specifies the precision of the estimates.

In fact, to get really close, we also have to specify an (in this case) more accurate method for the optimization, rather than using the default. We will use the powell method to do the search.

weighted_res_precise = minimize(sos_weighted,
                                [0, 0, 0],
                                args=(X, y, prop_professors_by_subject),
                                method='powell',
                                tol=1e-10)  # 0.0000000001
weighted_res_precise

 message: Optimization terminated successfully.
 success: True
  status: 0
     fun: 0.0018148520775737415
       x: [ 7.139e-01  7.991e-03  1.085e+00]
     nit: 7
   direc: [[-2.683e-02  2.738e-02  1.312e-02]
           [-2.747e-01 -1.944e-02  1.066e+00]
           [ 2.241e-13 -1.351e-13  7.467e-12]]
    nfev: 551

weighted_res_precise.x

array([0.7138681 , 0.00799134, 1.0846521 ])

Notice that we got very close to the Sklearn result.

sk_weighted.coef_, sk_weighted.intercept_

(array([0.7138681 , 0.00799134]), np.float64(1.0846520953764385))

Penalized regression#

Penalized regression is where you simultaneously minimize some cost related to the model (mis-)fit, and some cost related to the parameters of your model.

Ridge regression#

For example, in ridge regression, we try and minimize the sum of squared residuals and the sum of squares of the parameters (except for from the intercept).

sk_ridge = sklm.Ridge(alpha=1).fit(X, y)
sk_ridge.coef_, sk_ridge.intercept_

(array([0.50092516, 0.15572434]), np.float64(1.4041393822652166))

Fit with the minimize cost function:

def sos_ridge(params, X, y, alpha):
    ss_resid = sos(params, X, y)  # Using our sos function.
    return ss_resid + alpha * np.sum(params[:-1] ** 2)

Fit ridge regression with the minimize cost function:

res_ridge = minimize(sos_ridge, [0, 0, 0], args=(X, y, 1.0),
                     method='powell', tol=1e-10)
res_ridge.x

array([0.50092517, 0.15572433, 1.40413936])

LASSO#

See the Scikit-learn LASSO page.

As noted there, the cost function is:

\[ \frac{1}{ 2 * n } * ||y - Xw||^2_2 + alpha * ||w||_1 \]

\(w\) refers to the vector of model parameters.

This part of the equation:

\[ ||y - Xw||^2_2 \]

is the sum of squares of the residuals, because the residuals are \(y - Xw\) (where \(w\) are the parameters of the model, and \(Xw\) are therefore the fitted values), and the \(||y - Xw||^2_2\) refers to the squared L2 vector norm, which is the same as the sum of squares.

\[ ||w||_1 \]

is the L1 vector norm of the parameters \(w\), which is the sum of the absolute values of the parameters.

Let’s do that calculation, with a low alpha (otherwise both slopes get forced down to zero):

# We need LassoLars for increased accuracy.
sk_lasso = sklm.LassoLars(alpha=0.01).fit(clarity_easiness, helpfulness)

sk_lasso.coef_, sk_lasso.intercept_

(array([0.56798712, 0.00366785]), np.float64(1.6398158836228247))

Here is the equivalent minimize cost function:

def sos_lasso(params, X, y, alpha):
    ss_resid = sos(params, X, y)
    n = len(y)
    penalty = np.sum(np.abs(params[:-1]))
    return 1 / (2 * n) * ss_resid + alpha * penalty

res_lasso = minimize(sos_lasso, [0, 0, 0], args=(X, y, 0.01),
                     method='powell', tol=1e-10)
res_lasso

 message: Optimization terminated successfully.
 success: True
  status: 0
     fun: 0.008315817048158883
       x: [ 5.680e-01  3.668e-03  1.640e+00]
     nit: 5
   direc: [[ 0.000e+00  0.000e+00  1.000e+00]
           [-4.391e-01  0.000e+00  1.629e+00]
           [-3.755e-03  3.647e-03  2.276e-03]]
    nfev: 449

res_lasso.x

array([0.56798713, 0.00366785, 1.63981586])

Cross-validation#

Should I add the “Easiness” regressor? In other words, is a model that has both the “Clarity” and “Easiness” regressor better than a model that just has the “Clarity” regressor?

We could start by comparing the remaining (residual) sum of squares for the two models.

# Make a single-column dataframe with Clarity
clarity_df = clarity_easiness[['Clarity']]
clarity_df

	Clarity
0	3.756147
1	3.487379
2	3.608331
3	3.909520
4	3.788818
...	...
70	3.932991
71	3.636182
72	3.749000
73	3.441923
74	3.927255

75 rows × 1 columns

First we fit a linear model with the single-regressor “Clarity” model, and calculate the sum of squared residuals:

single_fit = sklm.LinearRegression().fit(
    clarity_df, helpfulness)
single_y_hat = single_fit.predict(clarity_df)
single_ss = np.sum((helpfulness - single_y_hat) ** 2)
single_ss

np.float64(0.18768752586074872)

Then we do the same for the two-regressor model, with “Clarity” and “Easiness”:

both_fit = sklm.LinearRegression().fit(
    clarity_easiness, helpfulness)
both_y_hat = both_fit.predict(clarity_easiness)
both_ss = np.sum((helpfulness - both_y_hat) ** 2)
both_ss

np.float64(0.18759984412745978)

There’s a small difference:

single_ss - both_ss

np.float64(8.768173328893569e-05)

Notice the sum of squared \(SS_{\text{resid}}\) is very slightly lower for the model with “Easiness”. But in fact, we could show that we would reduce the error by adding almost any regressor to the model. It turns out that the error can only ever go down, or stay the same, by adding another regressor.

In fact — let’s try it. We’ll add a regressor of random numbers to the “Clarity” regressor, and see how it fares:

rng = np.random.default_rng()

with_random = clarity_df.copy()
with_random['random'] = rng.normal(size=n)
with_random

	Clarity	random
0	3.756147	1.781814
1	3.487379	-3.056451
2	3.608331	-0.202730
3	3.909520	0.060513
4	3.788818	0.150505
...	...	...
70	3.932991	-0.284499
71	3.636182	-1.409122
72	3.749000	0.214077
73	3.441923	-0.888514
74	3.927255	-0.204216

75 rows × 2 columns

random_fit = sklm.LinearRegression().fit(
    with_random, helpfulness)
random_y_hat = random_fit.predict(with_random)
random_ss = np.sum((helpfulness - random_y_hat) ** 2)
random_ss

np.float64(0.18754866348079263)

single_ss - random_ss

np.float64(0.00013886237995608397)

Notice the random regressor does around as well (often better) than the “Easiness” regressor. That’s not a good sign for the usefulness of “Easiness”.

We would like a better test — one that tells us whether our model is better able to predict a new value that the model hasn’t seen before.

We can do this with cross-validation. Leave one out is a simple form of cross-validation. The procedure is, that we take each row in the dataset in turn. We drop that row from the dataset, and fit the model (estimate the parameters) on that dataset, with the row dropped. Then we use that model to predict the value from the row that we dropped — because this a value we did not use to build the model. We do this for all rows, making a model and predicting the value from the row we dropped. Then we ask — how well do the predicted values do in predicting the actual values?

This is how that would work for the first row:

Drop the first row and keep it somewhere.
Run the model on the remaining rows.
Use model to predict target value for first row.
Store prediction.

In code that might look like this:

row0_label = 0  # Row label of the first row.
# Row to drop, as a data frame:
dropped_row = df.loc[row0_label:row0_label]
dropped_row

	Discipline	Number of Professors	Clarity	Helpfulness	Overall Quality	Easiness
0	English	23343	3.756147	3.821866	3.791364	3.162754

We’ll be using these columns for the design and target:

x_cols = ['Clarity', 'Easiness']
y_col = 'Helpfulness'

Fit model with remaining rows, and predict target variable for first row:

# Dataframe without dropped row
remaining_df = df.drop(index=row0_label)
# Fit on everything but the dropped row.
fit = sklm.LinearRegression().fit(
    remaining_df[x_cols],
    remaining_df[y_col])
# Use fit to predict the dropped row.
fitted_val = fit.predict(dropped_row[x_cols])
fitted_val

array([3.79787918])

Then we keep going to generate a fitted value for every row.

Here’s a function to do the leave-one-out fit for a given row label:

def drop_and_predict(df, x_cols, y_col, row_label):
    """ Drop value identified by `row_label`, fit with rest and predict

    Parameters
    ----------
    df : DataFrame
    x_cols : sequence
        Sequence of column labels defining regressors.
    y_col : str
        Column label for target variable.
    row_label : object
        Row label of row to drop

    Returns
    -------
    fitted : scalar
        Fitted value for column `y_col` in row labeled `row_label`,
        using fit from all rows except row labeled `row_label`.
    """
    dropped_row = df.loc[row_label:row_label]
    # Dataframe without dropped row
    remaining_df = df.drop(index=row_label)
    # Fit on everything but the dropped row.
    fit = sklm.LinearRegression().fit(
        remaining_df[x_cols],
        remaining_df[y_col])
    # Use fit to predict target in the dropped row, and return.
    return fit.predict(dropped_row[x_cols])[0]

Use drop_and_predict to build a model for all rows but the first, as above, and then predict the “Helpfulness” value of the first row.

actual_value = df.loc[0, 'Helpfulness']
predicted_value = drop_and_predict(df,
                                   ['Clarity', 'Easiness'],
                                   'Helpfulness',
                                   0)
predicted_value

np.float64(3.7978791792022397)

Fit the model, with both “Clarity” and “Easiness”, and drop / predict each “Helpfulness” value.

predicted_with_easiness = df['Helpfulness'].copy()
for label in df.index:
    fitted = drop_and_predict(df,
                              ['Clarity', 'Easiness'],
                              'Helpfulness',
                              label)
    predicted_with_easiness.loc[label] = fitted
predicted_with_easiness

   3.797879
   3.568443
   3.673608
   3.928563
   3.829398
        ...   
  3.945049
  3.697332
  3.790545
  3.534346
  3.944375
Name: Helpfulness, Length: 75, dtype: float64

Get the sum of squared residuals for these predictions:

error_both = df['Helpfulness'] - predicted_with_easiness
# Sum of squared prediction errors for larger model.
np.sum(error_both ** 2)

np.float64(0.20079738301650638)

Do the same for the model with “Clarity” only.

predicted_without_easiness = df['Helpfulness'].copy()
for label in df.index:
    fitted = drop_and_predict(df, ['Clarity'], ['Helpfulness'], label)
    predicted_without_easiness.loc[label] = fitted
error_just_one = df['Helpfulness'] - predicted_without_easiness
# Sum of squared prediction errors for smaller model.
np.sum(error_just_one ** 2)

np.float64(0.1966941619436559)

The model without “Easiness” does a slightly better job of predicting, with leave-one-out cross-validation.