Sum of squares, root mean square#

The mean and slopes page used the Sum of Squared Error (SSE) as the measure of how well a particular slope fits the data.

Here we will think about another, derived measure, called the Root Mean Square Error.

First, let us go back to the original problem.

Hemoglobin and Packed Cell Volume, again#

# Run this cell.
import numpy as np
import matplotlib.pyplot as plt
# Make plots look a little bit more fancy
plt.style.use('fivethirtyeight')
# Print to 4 decimal places, show tiny values as 0
np.set_printoptions(precision=4, suppress=True)
import pandas as pd
pd.set_option('mode.copy_on_write', True)

We go back to the data on chronic kidney disease.

Download the data to your computer via this link: ckd_clean.csv.

We were interested to find the “best” slope for a line that relates the blood hemoglobin measures in each patient, with their corresponding Packed Cell Volume (PCV) blood measure:

# Run this cell
ckd = pd.read_csv('ckd_clean.csv')
hgb = np.array(ckd['Hemoglobin'])
pcv = np.array(ckd['Packed Cell Volume'])
# Plot HGB on the x axis, PCV on the y axis
plt.plot(hgb, pcv, 'o')
plt.xlabel('Hemoglobin concentration')
plt.ylabel('Packed cell volume')
Text(0, 0.5, 'Packed cell volume')
../_images/55465cdf793fbaae6490e5f2e7f51eefc1333f3444d05b317dcf5b67cc2201ff.png

We are assuming our line goes through hgb == 0 and pcv == 0, on the basis that the red blood cells measured by PCV contain all the hemoglobin, so if PCV is 0 then hemoglobin must be 0 and vice versa. Our predicted values for PCV for each patient are the given by some slopes multiplied by the corresponding hemoglobin value for that patient.

For example, say the slope is 2.25, then the predicted PCV values are:

s = 2.25
predicted_pcv = s * hgb
predicted_pcv[:10]
array([25.2  , 21.375, 24.3  , 12.6  , 17.325, 22.05 , 28.125, 22.5  ,
       23.625, 22.05 ])

The errors for these predictions are:

errors = pcv - predicted_pcv
errors[:10]
array([6.8  , 7.625, 7.7  , 3.4  , 6.675, 9.95 , 8.875, 6.5  , 9.375,
       5.95 ])

Sum of Squared error#

We decided to use the Sum of Squared Error (SSE) as a measure of the quality of fit.

The sum of squared error is nothing but:

sse = np.sum(errors ** 2)
sse
23274.978750000002

Here’s a function to return the SSE. We call this the cost function because it calculates a cost in terms of error for a given slope.

def calc_sse(slope):
    predicted_pcv = hgb * slope  # 'hgb' comes from the top level
    errors = pcv - predicted_pcv # 'pcv' comes from the top level
    return np.sum(errors ** 2)
# Recalculate SSE for slope of interest.
calc_sse(2.25)
23274.978750000002

Then we used minimize to find the slope that minimizes the SSE:

from scipy.optimize import minimize
# 2.25 below is the slope value to start the search.
res_sse = minimize(calc_sse, 2.25)
res_sse
  message: Desired error not necessarily achieved due to precision loss.
  success: False
   status: 2
      fun: 3619.6157878183285
        x: [ 3.047e+00]
      nit: 2
      jac: [ 1.221e-04]
 hess_inv: [[ 1.618e-05]]
     nfev: 70
     njev: 29

Notice the warning - there was some inaccuracy in the calculation. We may try slightly different cost functions to avoid that.

Root mean squared error#

One problem with the SSE is that it can get very large, especially if there are a large number of points. When the values get large, this can cause calculation problems. Very large numbers for SSE can make it harder to think about the SSE values.

One way of allowing for the number of points, is to use the mean square error (MSE), instead of the SSE. Here is the MSE cost function.

def calc_mse(slope):
    predicted_pcv = hgb * slope  # 'hgb' comes from the top level
    errors = pcv - predicted_pcv # 'pcv' comes from the top level
    return np.mean(errors ** 2)  # Notice mean instead of sum

We try minimize again, to find the best slope.

# 2.25 below is the slope value to start the search.
res_mse = minimize(calc_mse, 2.25)
res_mse
  message: Optimization terminated successfully.
  success: True
   status: 0
      fun: 22.908960682394497
        x: [ 3.047e+00]
      nit: 2
      jac: [ 2.384e-07]
 hess_inv: [[ 2.556e-03]]
     nfev: 6
     njev: 3

Notice there is no warning this time in the optimization message.

Notice that running minimize on the MSE found the same slope as it did for the SSE. The resulting cost function fun value is the original SSE value divided by the number of points. (It’s not exactly the same because of the inaccuracy of the SSE estimate).

res_sse.fun / len(hgb)
22.908960682394483

It’s also common to use the square root of the MSE, or the Root Mean Square Error. This has one advantage that it’s easier to compare the RMSE to the individual errors — we don’t normally think easily in terms of squared error. This is the RMSE cost function — another tiny variation.

def calc_rmse(slope):
    predicted_pcv = hgb * slope  # 'hgb' comes from the top level
    errors = pcv - predicted_pcv # 'pcv' comes from the top level
    return np.sqrt(np.mean(errors ** 2))  # Notice square root of mean.
# 2.25 below is the slope value to start the search.
res_rmse = minimize(calc_rmse, 2.25)
res_rmse
  message: Optimization terminated successfully.
  success: True
   status: 0
      fun: 4.78633060730185
        x: [ 3.047e+00]
      nit: 6
      jac: [ 0.000e+00]
 hess_inv: [[ 2.447e-02]]
     nfev: 14
     njev: 7

The cost function fun value is just the square root of the MSE fun value:

np.sqrt(res_mse.fun)
4.78633060730185

Notice again that minimize on RMSE found exactly the same slope as for MSE and SSE. Why?

Reflect a little, and then read on.

Monotonicity#

SSE, MSE, and RMSE give the same parameters from minimize because they are monotonic with respect to each other. Put another way, the SSE and the RMSE have a monotonic relationship. Put yet another way, the RMSE is a monotonic function of the SSE.

Remember, the Root Mean Square Error (RMSE) is just the square root of the Mean Square Error (MSE), and the MSE is the Sum of Squared Error (SSE) divided by the number of values in the x array (which is the same as the number of values in the y array).

For example, imagine we have the following set of SSE values:

n_slopes = 10000
some_slopes = np.linspace(2, 4, n_slopes)
SSEs = np.zeros(n_slopes)
for i in np.arange(n_slopes):
    SSEs[i] = calc_sse(some_slopes[i])

plt.scatter(some_slopes, SSEs)
plt.xlabel('Slope')
plt.xlabel('SSE')
plt.title('SSE as a function of slope');
../_images/fa80570eedfbaa4e91cd6519b910e249dbf1e251e48804d9c724e66903620257.png

We can calculate the matching MSE values by dividing by n (the number of hgb or pcv values).

n = len(hgb)
MSEs = SSEs / n  # Corresponding Mean Square Error

When we plot the MSE values as a function of (y-axis) the SSE values (x axis), we see that there is a straight line relationship, as we would expect, with a slope of 1 / n.

plt.scatter(SSEs, MSEs)
plt.xlabel('SSE values')
plt.ylabel('Corresponding MSE values')
plt.title('MSE as a function of SSE');
../_images/89769e245cd1f89a939f46cac234d906339138a77f1a409d37f829758aab2707.png

This straight line relationship is one example of a strictly increasing monotonic relationship.

Consider any two SSE values, call these S1 and S2. There are corresponding MSE values M1 and M2 (in fact given by S1 / n and S2 / n). The relationship between SSE and MSE is strictly increasing monotonic because, for any values S1 and S2:

  • when S1 > S2 then it is also true thatM1 > M2

  • when S1 < S2 then M1 < M2.

  • when S1 == S2, M1 == M2.

You can see that relationship in the plot above, because MSE always goes up, as SSE goes up.

The same is true of the RMSE values as a function of the SSE values:

RMSEs = np.sqrt(MSEs)  # Corresponding Root Mean Square Error
plt.scatter(SSEs, RMSEs)
plt.xlabel('SSE values')
plt.ylabel('Corresponding RMSE values')
plt.title('RMSE as a function of SSE');
../_images/f793b34c8b1668106fc66a346c737c77e2422bc0bb83a38d7c3b3e90dc19b9b7.png

Notice, again, that the RMSE always goes up as SSE goes up.

Now consider minimize. It is looking at the parameters to minimize the value for — say - SSE. But, you can probably see from the argument above, that if we found the parameters to minimize SSE, those must also be the parameters to minimize MSE, and to minimize RMSE, because once we have found the smallest SSE, that must also be the smallest MSE or RMSE.

Therefore, we can minimize on the RMSE, and we are guaranteed to get the same parameters, within calculation precision, as we would if we had minimized on the SSE or the MSE.

In the rest of these textbook pages, we will nearly always use RMSE, because it’s a common metric, and one that is relatively easy to interpret as a typical error for single point.