Fitting models with different cost functions#
We also get on to cross-validation.
import numpy as np
np.set_printoptions(suppress=True)
from scipy.optimize import minimize
import pandas as pd
pd.set_option('mode.copy_on_write', True)
import statsmodels.formula.api as smf
import sklearn.linear_model as sklm
import sklearn.metrics as skmetrics
df = pd.read_csv('data/rate_my_course.csv')
df
Discipline | Number of Professors | Clarity | Helpfulness | Overall Quality | Easiness | |
---|---|---|---|---|---|---|
0 | English | 23343 | 3.756147 | 3.821866 | 3.791364 | 3.162754 |
1 | Mathematics | 22394 | 3.487379 | 3.641526 | 3.566867 | 3.063322 |
2 | Biology | 11774 | 3.608331 | 3.701530 | 3.657641 | 2.710459 |
3 | Psychology | 11179 | 3.909520 | 3.887536 | 3.900949 | 3.316210 |
4 | History | 11145 | 3.788818 | 3.753642 | 3.773746 | 3.053803 |
... | ... | ... | ... | ... | ... | ... |
70 | Anatomy | 117 | 3.932991 | 3.974701 | 3.954188 | 2.863504 |
71 | Earth Science | 110 | 3.636182 | 3.671364 | 3.655091 | 3.106727 |
72 | Linguistics | 110 | 3.749000 | 3.834545 | 3.798182 | 3.309636 |
73 | Mechanical Engineering | 104 | 3.441923 | 3.531154 | 3.489327 | 2.799135 |
74 | Medicine | 102 | 3.927255 | 3.934216 | 3.929118 | 3.109118 |
75 rows × 6 columns
Fetch some columns of interest:
# This will be our y (the variable we predict).
helpfulness = df['Helpfulness']
# One of both of these will be our X (the predictors).
clarity_easiness = df[['Clarity', 'Easiness']]
Fit the model with Statsmodels.
sm_model = smf.ols('Helpfulness ~ Clarity + Easiness', data=df)
sm_fit = sm_model.fit()
sm_fit.summary()
Dep. Variable: | Helpfulness | R-squared: | 0.913 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.911 |
Method: | Least Squares | F-statistic: | 378.0 |
Date: | Tue, 03 Sep 2024 | Prob (F-statistic): | 6.53e-39 |
Time: | 10:05:26 | Log-Likelihood: | 118.24 |
No. Observations: | 75 | AIC: | -230.5 |
Df Residuals: | 72 | BIC: | -223.5 |
Df Model: | 2 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 0.6342 | 0.116 | 5.483 | 0.000 | 0.404 | 0.865 |
Clarity | 0.8477 | 0.047 | 18.160 | 0.000 | 0.755 | 0.941 |
Easiness | -0.0063 | 0.034 | -0.183 | 0.855 | -0.075 | 0.062 |
Omnibus: | 3.832 | Durbin-Watson: | 1.847 |
---|---|---|---|
Prob(Omnibus): | 0.147 | Jarque-Bera (JB): | 3.429 |
Skew: | 0.524 | Prob(JB): | 0.180 |
Kurtosis: | 3.032 | Cond. No. | 103. |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Fit the same model with Scikit-learn.
sk_model = sklm.LinearRegression()
sk_fit = sk_model.fit(clarity_easiness, helpfulness)
sk_fit
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
The coefficients (the slopes for the regressors):
sk_fit.coef_
array([ 0.84765719, -0.00628188])
The intercept:
sk_fit.intercept_
np.float64(0.6342163349072085)
Compare the parameters to Statsmodels:
sm_fit.params
Intercept 0.634216
Clarity 0.847657
Easiness -0.006282
dtype: float64
The fitted values and Scikit-learn#
The values predicted by the (Sklearn) model:
y_hat = sk_fit.predict(clarity_easiness)
y_hat
array([3.79827334, 3.57107499, 3.67581734, 3.92731675, 3.82665181,
3.48871883, 3.89117061, 3.70003921, 3.8013664 , 3.48333495,
3.69359333, 3.78395243, 3.47336086, 3.57017191, 3.81165212,
3.59953202, 3.40917203, 3.74439846, 3.75529281, 3.87079006,
3.71650243, 3.91081494, 3.84021344, 3.46140189, 4.05110024,
3.69790129, 3.68125641, 3.73563421, 3.66975401, 3.76540619,
3.77108961, 3.89775444, 3.81443707, 3.45401091, 3.71363585,
3.84224058, 4.06748055, 3.80755826, 3.53711396, 3.7465973 ,
3.68031767, 3.68119022, 3.88083408, 3.8323017 , 3.60821429,
3.82505099, 3.69906336, 3.97423909, 3.96735067, 4.11466129,
3.80967881, 3.60364526, 3.63462294, 3.68518199, 3.74639111,
3.47842488, 3.76251001, 3.60481205, 3.93161439, 3.94669493,
3.54215346, 3.96030499, 3.74421691, 3.81391267, 4.13556455,
3.88432985, 3.80636234, 3.99835993, 3.84638216, 3.69029801,
3.95005664, 3.69693593, 3.79129242, 3.53420337, 3.9436511 ])
Compare the fitted (\(\hat{y}\)) values to those from Statsmodels:
sm_fit.predict() - y_hat
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0.])
Here’s a general check that the predictions are all the same (or close enough within computational precision):
assert np.allclose(sm_fit.predict(), y_hat)
We assemble Sklearn’s coefficients and intercept into a single list, with the intercept last.
params = list(sk_fit.coef_) + [sk_fit.intercept_]
params
[np.float64(0.8476571912478371),
np.float64(-0.006281876019590555),
np.float64(0.6342163349072085)]
If we just want all but the last parameter (all the coefficients, but not the intercept):
# All parameters but the last (all but the intercept parameter).
params[:-1]
[np.float64(0.8476571912478371), np.float64(-0.006281876019590555)]
We could also get the parameters from Statsmodels, but we’d have to rearrange them, because Statsmodels puts the intercept first rather than last:
sm_fit.params.iloc[[1, 2, 0]]
Clarity 0.847657
Easiness -0.006282
Intercept 0.634216
dtype: float64
Write a function to compute the fitted values given the parameters:
def calc_fitted(params, X):
""" Calculate fitted values from design X and parameters
Parameters
----------
params : vector (1D array)
Vector of parameters, intercept is last parameter.
X : array
2D array with regressor columns.
Returns
-------
y_hat : vector
Vector of fitted values
"""
X = np.array(X)
n, p = X.shape
y_hat = np.zeros(n)
for col_no, param in enumerate(params[:-1]):
y_hat = y_hat + param * X[:, col_no] # Add contribution from this regressor.
y_hat = y_hat + params[-1] # Add contribution from intercept.
return y_hat
Show that we get the same fitted values from our function as we got from
the predict
method of Sklearn:
our_y_hat = calc_fitted(params, clarity_easiness)
assert np.allclose(our_y_hat, y_hat)
Calculate the error vector and then calculate the sum of squared error:
e = helpfulness - our_y_hat
np.sum(e ** 2)
np.float64(0.18759984412745978)
Make a function to calculate sum of squared error:
def sos(params, X, y):
""" Sum of squared error for `params` given model `X` and data `y`.
"""
y_hat = calc_fitted(params, X)
e = y - y_hat # residuals
return np.sum(e ** 2)
Check that we get the same answer from the function as we got from calculating above:
sos(params, clarity_easiness, helpfulness)
np.float64(0.18759984412745978)
Use minimize
to find parameters minimizing the sum of squared error:
min_res = minimize(sos, [0, 0, 0], args=(clarity_easiness, helpfulness))
min_res
message: Optimization terminated successfully.
success: True
status: 0
fun: 0.18759984414069664
x: [ 8.477e-01 -6.281e-03 6.342e-01]
nit: 8
jac: [ 3.278e-07 2.831e-07 1.155e-07]
hess_inv: [[ 4.169e-01 -2.313e-01 -8.092e-01]
[-2.313e-01 2.249e-01 1.414e-01]
[-8.092e-01 1.414e-01 2.557e+00]]
nfev: 40
njev: 10
Yes, the parameters (coefficients and intercept) are the same as we got from Statsmodels and Sklearn:
min_res.x
array([ 0.84765427, -0.00628103, 0.63422445])
A short diversion — writing the mean in symbols#
For notational convenience, give our y
(“Helpfulness”) vector the
variable name y
:
y = helpfulness
This is the usual mean, on y
(the ‘Helpfulness” scores):
np.mean(y)
np.float64(3.758165438295424)
The calculation is:
n = len(y)
np.sum(y) / n
np.float64(3.758165438295424)
Of course this is the same as:
1 / n * np.sum(y)
np.float64(3.758165438295424)
In mathematical notation, we write this as:
where \(\bar{y}\) is the mean of the values in the vector \(\vec{y}\).
We could also write the operation of adding up the \(y\) values with the \(\sum\) notation:
Using the \(\sum\) notation, we write the mean as:
This notation is very similar to the equivalent code:
1 / n * np.sum(y)
np.float64(3.758165438295424)
The long and the short of \(R^2\)#
\(R^2\) can also be called the coefficient of determination.
Sklearn has an r2_score
metric.
skmetrics.r2_score(helpfulness, y_hat)
0.9130445684300285
The Statsmodels fit
object has an rsquared
attribute with the \(R^2\) value:
sm_fit.rsquared
np.float64(0.9130445684300286)
The formula for \(R^2\) is:
If \(\hat{y}_i\) is the fitted value for observation \(i\), then the residual error \(e_i\) for observation \(i\) is \(y_i - \hat{y}_i\), and \(SS_{\text{resid}}\) is the sum of squares of the residuals:
ss_resid = np.sum((y - y_hat) ** 2)
ss_resid
np.float64(0.18759984412745978)
As you saw above, we write the mean of the observed values \(y\) as \(\bar{y}\).
The denominator for the standard \(R^2\) statistic is the “total sum of squares” \(SS_\text{tot}\), written as:
ss_total = np.sum((y - np.mean(y)) ** 2)
ss_total
np.float64(2.157425254988258)
We can calculate \(R^2\) by hand to show this gives the same answer as Sklearn and Statsmodels.
# Calculate R2
1 - ss_resid / ss_total
np.float64(0.9130445684300285)
The by-hand calculation gives the same answer as Sklearn or Statsmodels.
sm_fit.rsquared
np.float64(0.9130445684300286)
\(R^2\) for another, reduced model#
You can think of the standard \(R^2\) above as a measure of the improvement in fit for an expanded model (with the regressors included) compared to a reduced model (that only uses the mean).
You can use the same idea to compare an expanded model to another reduced model.
For example, we can fit a reduced model that is just “Clarity” without “Easiness”.
clarity_df = df[['Clarity']]
reduced_fit = sklm.LinearRegression().fit(clarity_df, helpfulness)
reduced_fit.coef_, reduced_fit.intercept_
(array([0.84119015]), np.float64(0.6381903166650691))
Calculate \(R^2\) for reduced model as compared to our full model.
reduced_resid = helpfulness - reduced_fit.predict(clarity_df)
ss_reduced_resid = (reduced_resid ** 2).sum()
ss_reduced_resid
np.float64(0.18768752586074872)
1 - ss_resid / ss_reduced_resid
np.float64(0.0004671686777627526)
On weights, and a weighted mean#
You remember, from the discussion of the mean, above, that we can write the mean calculation as:
In code:
1 / n * np.sum(y)
np.float64(3.758165438295424)
Mathematically, because \(p * (q + r) = p * q + p * r\) (the distributive property of multiplication), we can also do the multiplication by \(\frac{1}{n}\) inside the brackets, like this:
With the \(\sum\) notation, that would be:
In code:
np.sum(1 / n * y)
np.float64(3.7581654382954235)
Think of this - the standard calculation of the mean - as giving each value in \(y\) the same weight - of \(\frac{1}{n}\):
weights = np.ones(n) / n
weights
array([0.01333333, 0.01333333, 0.01333333, 0.01333333, 0.01333333,
0.01333333, 0.01333333, 0.01333333, 0.01333333, 0.01333333,
0.01333333, 0.01333333, 0.01333333, 0.01333333, 0.01333333,
0.01333333, 0.01333333, 0.01333333, 0.01333333, 0.01333333,
0.01333333, 0.01333333, 0.01333333, 0.01333333, 0.01333333,
0.01333333, 0.01333333, 0.01333333, 0.01333333, 0.01333333,
0.01333333, 0.01333333, 0.01333333, 0.01333333, 0.01333333,
0.01333333, 0.01333333, 0.01333333, 0.01333333, 0.01333333,
0.01333333, 0.01333333, 0.01333333, 0.01333333, 0.01333333,
0.01333333, 0.01333333, 0.01333333, 0.01333333, 0.01333333,
0.01333333, 0.01333333, 0.01333333, 0.01333333, 0.01333333,
0.01333333, 0.01333333, 0.01333333, 0.01333333, 0.01333333,
0.01333333, 0.01333333, 0.01333333, 0.01333333, 0.01333333,
0.01333333, 0.01333333, 0.01333333, 0.01333333, 0.01333333,
0.01333333, 0.01333333, 0.01333333, 0.01333333, 0.01333333])
np.sum(y * weights)
np.float64(3.7581654382954235)
We can calculate weights to weight each “Helpfulness” value by the number of professors. Maybe we would do this on the basis that larger number of professors may give more reliable values, and should therefore contribute relatively more to the mean.
n_professors = df['Number of Professors']
n_professors
0 23343
1 22394
2 11774
3 11179
4 11145
...
70 117
71 110
72 110
73 104
74 102
Name: Number of Professors, Length: 75, dtype: int64
total_n_professors = np.sum(n_professors)
total_n_professors
np.int64(183894)
Calculate the proportion of professors represented by each discipline:
prop_professors_by_subject = n_professors / total_n_professors
prop_professors_by_subject
0 0.126937
1 0.121777
2 0.064026
3 0.060790
4 0.060606
...
70 0.000636
71 0.000598
72 0.000598
73 0.000566
74 0.000555
Name: Number of Professors, Length: 75, dtype: float64
Now we have divided each value by the sum of the values, to give the proportions, the proportions must add up to 1:
np.sum(prop_professors_by_subject)
np.float64(1.0)
Calculate weighted mean, where the weights are the proportions of professors in the matching discipline:
# Weighted mean of helpfulness (y), weighted by number of professors.
np.sum(y * prop_professors_by_subject)
np.float64(3.7234004372083844)
Numpy’s version of same:
np.average(y, weights=prop_professors_by_subject)
np.float64(3.7234004372083844)
And in fact, Numpy will check whether the weights sum to 1, and if not, will automatically divide the weights by their sum, so we can get the same calculation with the raw numbers of professors:
np.average(y, weights=n_professors)
np.float64(3.723400437208384)
Fitting with weights#
Weighted regression differs from standard regression, only in weighting each squared residual \(e^2_i = (y_i - \hat{y_i})^2\) by some weight for that observation — call this \(w_i\).
As we saw above in standard regression — AKA “ordinary least squares” regression, the cost function is the sum of squares of the residuals (\(SS_{resid}\)):
In weighted regression we have a weight \(w_i\) for each observation, and the cost function is:
Also see Wikipedia on weighted regression.
First we show weighted regression in action, and then show how this cost function works in code, with minimize
.
Here we use Statsmodels to do weighted regression.
sm_weighted_model = smf.wls('Helpfulness ~ Clarity + Easiness',
weights=prop_professors_by_subject,
data=df)
sm_weighted_fit = sm_weighted_model.fit()
sm_weighted_fit.summary()
Dep. Variable: | Helpfulness | R-squared: | 0.898 |
---|---|---|---|
Model: | WLS | Adj. R-squared: | 0.895 |
Method: | Least Squares | F-statistic: | 316.8 |
Date: | Tue, 03 Sep 2024 | Prob (F-statistic): | 2.06e-36 |
Time: | 10:05:27 | Log-Likelihood: | 90.990 |
No. Observations: | 75 | AIC: | -176.0 |
Df Residuals: | 72 | BIC: | -169.0 |
Df Model: | 2 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 1.0847 | 0.105 | 10.293 | 0.000 | 0.875 | 1.295 |
Clarity | 0.7139 | 0.041 | 17.596 | 0.000 | 0.633 | 0.795 |
Easiness | 0.0080 | 0.031 | 0.257 | 0.798 | -0.054 | 0.070 |
Omnibus: | 13.040 | Durbin-Watson: | 1.729 |
---|---|---|---|
Prob(Omnibus): | 0.001 | Jarque-Bera (JB): | 21.801 |
Skew: | 0.619 | Prob(JB): | 1.84e-05 |
Kurtosis: | 5.333 | Cond. No. | 108. |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
To save some typing, and for notational convenience in code, call our
two-regressor design X
:
X = clarity_easiness
X
Clarity | Easiness | |
---|---|---|
0 | 3.756147 | 3.162754 |
1 | 3.487379 | 3.063322 |
2 | 3.608331 | 2.710459 |
3 | 3.909520 | 3.316210 |
4 | 3.788818 | 3.053803 |
... | ... | ... |
70 | 3.932991 | 2.863504 |
71 | 3.636182 | 3.106727 |
72 | 3.749000 | 3.309636 |
73 | 3.441923 | 2.799135 |
74 | 3.927255 | 3.109118 |
75 rows × 2 columns
You can also do weighted regression in
Sklearn, by passing weights to the LinearRegression
fit
method:
sk_weighted = sklm.LinearRegression().fit(
X, # clarity_easiness
y, # Helpfulness
sample_weight=prop_professors_by_subject)
sk_weighted
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
sk_weighted.coef_, sk_weighted.intercept_
(array([0.7138681 , 0.00799134]), np.float64(1.0846520953764385))
The minimize
cost function for weighted regression:
def sos_weighted(params, X, y, weights):
""" Weighted least squares cost function
"""
y_hat = calc_fitted(params, X)
e = y - y_hat # Vector of residuals (errors)
e2 = e ** 2 # The vector of squared errors.
return np.sum(e2 * weights)
weighted_res = minimize(sos_weighted, [0, 0, 0], args=(X, y, prop_professors_by_subject))
weighted_res
message: Optimization terminated successfully.
success: True
status: 0
fun: 0.0018148520779416857
x: [ 7.139e-01 7.992e-03 1.085e+00]
nit: 11
jac: [-7.451e-09 2.285e-09 1.451e-08]
hess_inv: [[ 3.246e+01 -1.768e+01 -6.389e+01]
[-1.768e+01 1.918e+01 5.111e+00]
[-6.389e+01 5.111e+00 2.186e+02]]
nfev: 60
njev: 15
weighted_res.x
array([0.71386388, 0.00799227, 1.08466463])
Notice these are the same (within very close limits) to the parameters we got from Statsmodels and Sklearn:
sk_weighted.coef_, sk_weighted.intercept_
(array([0.7138681 , 0.00799134]), np.float64(1.0846520953764385))
We can get minimize
even closer by setting the tol
value for minimize
to be a very small number. tol
specifies how small the difference has to be when making small changes in the parameters, before minimize
accepts it is close enough to the right answer and stops. Put another way, it specifies the precision of the estimates.
In fact, to get really close, we also have to specify an (in this case) more accurate method for the optimization, rather than using the default. We will use the powell
method to do the search.
weighted_res_precise = minimize(sos_weighted,
[0, 0, 0],
args=(X, y, prop_professors_by_subject),
method='powell',
tol=1e-10) # 0.0000000001
weighted_res_precise
message: Optimization terminated successfully.
success: True
status: 0
fun: 0.0018148520775737415
x: [ 7.139e-01 7.991e-03 1.085e+00]
nit: 7
direc: [[-2.683e-02 2.738e-02 1.312e-02]
[-2.747e-01 -1.944e-02 1.066e+00]
[ 2.241e-13 -1.351e-13 7.467e-12]]
nfev: 551
weighted_res_precise.x
array([0.7138681 , 0.00799134, 1.0846521 ])
Notice that we got very close to the Sklearn result.
sk_weighted.coef_, sk_weighted.intercept_
(array([0.7138681 , 0.00799134]), np.float64(1.0846520953764385))
Penalized regression#
Penalized regression is where you simultaneously minimize some cost related to the model (mis-)fit, and some cost related to the parameters of your model.
Ridge regression#
For example, in ridge regression, we try and minimize the sum of squared residuals and the sum of squares of the parameters (except for from the intercept).
sk_ridge = sklm.Ridge(alpha=1).fit(X, y)
sk_ridge.coef_, sk_ridge.intercept_
(array([0.50092516, 0.15572434]), np.float64(1.4041393822652166))
Fit with the minimize
cost function:
def sos_ridge(params, X, y, alpha):
ss_resid = sos(params, X, y) # Using our sos function.
return ss_resid + alpha * np.sum(params[:-1] ** 2)
Fit ridge regression with the minimize
cost function:
res_ridge = minimize(sos_ridge, [0, 0, 0], args=(X, y, 1.0),
method='powell', tol=1e-10)
res_ridge.x
array([0.50092517, 0.15572433, 1.40413936])
LASSO#
See the Scikit-learn LASSO page.
As noted there, the cost function is:
\(w\) refers to the vector of model parameters.
This part of the equation:
is the sum of squares of the residuals, because the residuals are \(y - Xw\) (where \(w\) are the parameters of the model, and \(Xw\) are therefore the fitted values), and the \(||y - Xw||^2_2\) refers to the squared L2 vector norm, which is the same as the sum of squares.
is the L1 vector norm of the parameters \(w\), which is the sum of the absolute values of the parameters.
Let’s do that calculation, with a low alpha
(otherwise both slopes get
forced down to zero):
# We need LassoLars for increased accuracy.
sk_lasso = sklm.LassoLars(alpha=0.01).fit(clarity_easiness, helpfulness)
sk_lasso.coef_, sk_lasso.intercept_
(array([0.56798712, 0.00366785]), np.float64(1.6398158836228247))
Here is the equivalent minimize
cost function:
def sos_lasso(params, X, y, alpha):
ss_resid = sos(params, X, y)
n = len(y)
penalty = np.sum(np.abs(params[:-1]))
return 1 / (2 * n) * ss_resid + alpha * penalty
res_lasso = minimize(sos_lasso, [0, 0, 0], args=(X, y, 0.01),
method='powell', tol=1e-10)
res_lasso
message: Optimization terminated successfully.
success: True
status: 0
fun: 0.008315817048158883
x: [ 5.680e-01 3.668e-03 1.640e+00]
nit: 5
direc: [[ 0.000e+00 0.000e+00 1.000e+00]
[-4.391e-01 0.000e+00 1.629e+00]
[-3.755e-03 3.647e-03 2.276e-03]]
nfev: 449
res_lasso.x
array([0.56798713, 0.00366785, 1.63981586])
Cross-validation#
Should I add the “Easiness” regressor? In other words, is a model that has both the “Clarity” and “Easiness” regressor better than a model that just has the “Clarity” regressor?
We could start by comparing the remaining (residual) sum of squares for the two models.
# Make a single-column dataframe with Clarity
clarity_df = clarity_easiness[['Clarity']]
clarity_df
Clarity | |
---|---|
0 | 3.756147 |
1 | 3.487379 |
2 | 3.608331 |
3 | 3.909520 |
4 | 3.788818 |
... | ... |
70 | 3.932991 |
71 | 3.636182 |
72 | 3.749000 |
73 | 3.441923 |
74 | 3.927255 |
75 rows × 1 columns
First we fit a linear model with the single-regressor “Clarity” model, and calculate the sum of squared residuals:
single_fit = sklm.LinearRegression().fit(
clarity_df, helpfulness)
single_y_hat = single_fit.predict(clarity_df)
single_ss = np.sum((helpfulness - single_y_hat) ** 2)
single_ss
np.float64(0.18768752586074872)
Then we do the same for the two-regressor model, with “Clarity” and “Easiness”:
both_fit = sklm.LinearRegression().fit(
clarity_easiness, helpfulness)
both_y_hat = both_fit.predict(clarity_easiness)
both_ss = np.sum((helpfulness - both_y_hat) ** 2)
both_ss
np.float64(0.18759984412745978)
There’s a small difference:
single_ss - both_ss
np.float64(8.768173328893569e-05)
Notice the sum of squared \(SS_{\text{resid}}\) is very slightly lower for the model with “Easiness”. But in fact, we could show that we would reduce the error by adding almost any regressor to the model. It turns out that the error can only ever go down, or stay the same, by adding another regressor.
In fact — let’s try it. We’ll add a regressor of random numbers to the “Clarity” regressor, and see how it fares:
rng = np.random.default_rng()
with_random = clarity_df.copy()
with_random['random'] = rng.normal(size=n)
with_random
Clarity | random | |
---|---|---|
0 | 3.756147 | 1.781814 |
1 | 3.487379 | -3.056451 |
2 | 3.608331 | -0.202730 |
3 | 3.909520 | 0.060513 |
4 | 3.788818 | 0.150505 |
... | ... | ... |
70 | 3.932991 | -0.284499 |
71 | 3.636182 | -1.409122 |
72 | 3.749000 | 0.214077 |
73 | 3.441923 | -0.888514 |
74 | 3.927255 | -0.204216 |
75 rows × 2 columns
random_fit = sklm.LinearRegression().fit(
with_random, helpfulness)
random_y_hat = random_fit.predict(with_random)
random_ss = np.sum((helpfulness - random_y_hat) ** 2)
random_ss
np.float64(0.18754866348079263)
single_ss - random_ss
np.float64(0.00013886237995608397)
Notice the random regressor does around as well (often better) than the “Easiness” regressor. That’s not a good sign for the usefulness of “Easiness”.
We would like a better test — one that tells us whether our model is better able to predict a new value that the model hasn’t seen before.
We can do this with cross-validation. Leave one out is a simple form of cross-validation. The procedure is, that we take each row in the dataset in turn. We drop that row from the dataset, and fit the model (estimate the parameters) on that dataset, with the row dropped. Then we use that model to predict the value from the row that we dropped — because this a value we did not use to build the model. We do this for all rows, making a model and predicting the value from the row we dropped. Then we ask — how well do the predicted values do in predicting the actual values?
This is how that would work for the first row:
Drop the first row and keep it somewhere.
Run the model on the remaining rows.
Use model to predict target value for first row.
Store prediction.
In code that might look like this:
row0_label = 0 # Row label of the first row.
# Row to drop, as a data frame:
dropped_row = df.loc[row0_label:row0_label]
dropped_row
Discipline | Number of Professors | Clarity | Helpfulness | Overall Quality | Easiness | |
---|---|---|---|---|---|---|
0 | English | 23343 | 3.756147 | 3.821866 | 3.791364 | 3.162754 |
We’ll be using these columns for the design and target:
x_cols = ['Clarity', 'Easiness']
y_col = 'Helpfulness'
Fit model with remaining rows, and predict target variable for first row:
# Dataframe without dropped row
remaining_df = df.drop(index=row0_label)
# Fit on everything but the dropped row.
fit = sklm.LinearRegression().fit(
remaining_df[x_cols],
remaining_df[y_col])
# Use fit to predict the dropped row.
fitted_val = fit.predict(dropped_row[x_cols])
fitted_val
array([3.79787918])
Then we keep going to generate a fitted value for every row.
Here’s a function to do the leave-one-out fit for a given row label:
def drop_and_predict(df, x_cols, y_col, row_label):
""" Drop value identified by `row_label`, fit with rest and predict
Parameters
----------
df : DataFrame
x_cols : sequence
Sequence of column labels defining regressors.
y_col : str
Column label for target variable.
row_label : object
Row label of row to drop
Returns
-------
fitted : scalar
Fitted value for column `y_col` in row labeled `row_label`,
using fit from all rows except row labeled `row_label`.
"""
dropped_row = df.loc[row_label:row_label]
# Dataframe without dropped row
remaining_df = df.drop(index=row_label)
# Fit on everything but the dropped row.
fit = sklm.LinearRegression().fit(
remaining_df[x_cols],
remaining_df[y_col])
# Use fit to predict target in the dropped row, and return.
return fit.predict(dropped_row[x_cols])[0]
Use drop_and_predict
to build a model for all rows but the first, as
above, and then predict the “Helpfulness” value of the first row.
actual_value = df.loc[0, 'Helpfulness']
predicted_value = drop_and_predict(df,
['Clarity', 'Easiness'],
'Helpfulness',
0)
predicted_value
np.float64(3.7978791792022397)
Fit the model, with both “Clarity” and “Easiness”, and drop / predict each “Helpfulness” value.
predicted_with_easiness = df['Helpfulness'].copy()
for label in df.index:
fitted = drop_and_predict(df,
['Clarity', 'Easiness'],
'Helpfulness',
label)
predicted_with_easiness.loc[label] = fitted
predicted_with_easiness
0 3.797879
1 3.568443
2 3.673608
3 3.928563
4 3.829398
...
70 3.945049
71 3.697332
72 3.790545
73 3.534346
74 3.944375
Name: Helpfulness, Length: 75, dtype: float64
Get the sum of squared residuals for these predictions:
error_both = df['Helpfulness'] - predicted_with_easiness
# Sum of squared prediction errors for larger model.
np.sum(error_both ** 2)
np.float64(0.20079738301650638)
Do the same for the model with “Clarity” only.
predicted_without_easiness = df['Helpfulness'].copy()
for label in df.index:
fitted = drop_and_predict(df, ['Clarity'], ['Helpfulness'], label)
predicted_without_easiness.loc[label] = fitted
error_just_one = df['Helpfulness'] - predicted_without_easiness
# Sum of squared prediction errors for smaller model.
np.sum(error_just_one ** 2)
np.float64(0.1966941619436559)
The model without “Easiness” does a slightly better job of predicting, with leave-one-out cross-validation.