Finding lines

Finding lines#

In The Mean and Slopes, we were looking for the best slope to predict one vector of values from another vector of values.

Specifically, we wanted our slope to predict the Packed Cell Volume (PCV) values from the Hemoglobin (HGB) values.

By analogy with The Mean as Predictor, we decided to choose our line to minimize the average prediction errors, and the sum of squared prediction errors.

We found a solution, by trying many slopes, and choosing the slope giving use the smallest error.

For our question, we were happy to assume that the line passed through 0, 0 — meaning, that when the Hemoglobin is 0, the Packed Cell Volume value is 0. Put another way, we assumed that our line had an intercept value of 0. The intercept is the y value at which the line crosses the y axis, or, put another way, the y value when the x value is 0.

What if we are in a situation where we want to find a line that had a (not zero) intercept, as well as a slope?

import numpy as np
import pandas as pd
pd.set_option('mode.copy_on_write', True)
import matplotlib.pyplot as plt
# Make plots look a little bit more fancy
plt.style.use('fivethirtyeight')
# Print to 4 decimal places, show tiny values as 0
np.set_printoptions(precision=4, suppress=True)

We return to the students ratings dataset dataset.

This is a dataset, in Excel form, where each row is the average of students’ ratings from <RateMyProfessors.com> across a single subject. Thus, the first row refers to the average of all professors teaching English, the second row refers to all professors teaching Mathematics, and so on.

Download the data file via this link rate_my_course.csv.

Next we load the data.

# Read the data file
ratings = pd.read_csv('rate_my_course.csv')
ratings.head()
Discipline Number of Professors Clarity Helpfulness Overall Quality Easiness
0 English 23343 3.756147 3.821866 3.791364 3.162754
1 Mathematics 22394 3.487379 3.641526 3.566867 3.063322
2 Biology 11774 3.608331 3.701530 3.657641 2.710459
3 Psychology 11179 3.909520 3.887536 3.900949 3.316210
4 History 11145 3.788818 3.753642 3.773746 3.053803

We are interested in the relationship of the “Overall Quality” measure to the “Easiness” measure.

# Get the columns as arrays for simplicity and speed.
easiness = np.array(ratings['Easiness'])
quality = np.array(ratings['Overall Quality'])

Do students rate easier courses as being of better quality?

plt.plot(easiness, quality, 'o')
plt.xlabel('Easiness')
plt.ylabel('Overall Quality')
Text(0, 0.5, 'Overall Quality')
../_images/51ea446e5f8294d1103708c79974e18068bebaa790d3ef095860e75e7a3e08a9.png

There might be a straight-line relationship here, but it doesn’t look as if it would go through 0, 0:

# The same plot as above, but showing the x, y origin at 0, 0
plt.plot(easiness, quality, 'o')
plt.xlabel('Easiness')
plt.ylabel('Overall Quality')
# Make sure 0, 0 is on the plot.
plt.axis([0, 3.9, 0, 4.2])
(0.0, 3.9, 0.0, 4.2)
../_images/88c709eaa0d5075685888c6cdfb1a7acc29dbaa058f998508ccbb4f123700bac.png

In The Mean and Slopes, we assumed that the intercept was zero, so we only had to try different slopes to get our best line.

Here we have a different problem, because we want to find a line that has an intercept that is not zero, so we need to find the best slope and the best intercept at the same time. How do we search for a slope as well as an intercept?

But wait - why do we have to search for the slope and the intercept at the same time? Can’t we just find the best slope, and then the best intercept?

In fact we can’t do that, because the best slope will change for every intercept.

To see why that is, we need to try a few different lines. To do that, we need to remind ourselves about defining lines, and then testing them.

Remember, we can describe a line with an intercept and a slope. Call the intercept \(c\) and a slope \(s\). A line predicts the \(y\) values from the \(x\) values, using the slope \(s\) and the intercept \(c\):

\[ y = c + x * s \]

Let’s start with a guess for the line, just from eyeballing the plot. We guess that:

  • The intercept is around 2.25

  • The slope is around 0.47

The predicted \(y\) values from this line are (from the formula above):

predicted = 2.25 + easiness * 0.47

where easiness contains our actual \(x\) values.

The prediction error at each point come from the actual \(y\) values minus the predicted \(y\) values.

errors = quality - predicted

where quality contains our actual \(y\) values.

We can look at the predictions for this line (in red), and the actual values (in blue) and then the errors (the lengths of the dotted lines joining the red predictions and the corresponding blue actual values).

# Don't worry about this code.
# It plots the line, and the errors.
x_values = easiness  # The thing we're predicting from, on the x axis
y_values = quality  # The thing we're predicting, on the y axis.
plt.plot(x_values, y_values, 'o')
plt.plot(x_values, predicted, 'o', color='red')
# Draw a line between predicted and actual
for i in np.arange(len(x_values)):
    x = x_values[i]
    y_0 = predicted[i]
    y_1 = y_values[i]
    plt.plot([x, x], [y_0, y_1], ':', color='black', linewidth=1)
../_images/e7ad62f8d2d834cc2aca00d9ec1e77ad28b9ae5a942084099ca23984b7bf425b.png

The Root Mean Squared Error (RMSE) is:

# Root mean squared error given c and s
rmse_c_s = np.sqrt(np.mean(errors ** 2))
rmse_c_s
0.12027697029219235

Actually, those bits of code are so useful, let’s make them into a function:

def plot_with_errors(x_values, y_values, c, s):
    """ Plot a line through data with errors

    Parameters
    ----------
    x_values : array
        Values we are predicting from, for the x-axis of the plot.
    y_values : array
        Values we are predicting, for the y-axis of the plot.
    c : number
        Intercept for predicting line.
    s : number
        Slope for predicting line.

    Returns
    -------
    rmse : number
        The square root of the mean squared error, for the given `x_values`,
        `y_values` and line.
    """
    # Predict the y values from the line.
    predicted = c + s * x_values
    # Errors are the real values minus the predicted values.
    errors = y_values - predicted
    # Plot real values in blue, predicted values in red.
    plt.plot(x_values, y_values, 'o', color='blue')
    plt.plot(x_values, predicted, 'o', color='red')
    # Draw a line between predicted and actual
    for i in np.arange(len(x_values)):
        x = x_values[i]
        y_0 = predicted[i]
        y_1 = y_values[i]
        plt.plot([x, x], [y_0, y_1], ':', color='black', linewidth=1)
    return np.sqrt(np.mean(errors ** 2))

Notice the string at the top of the function, giving details about what the function does, its arguments, and the values it returns. This is called the docstring. It can remind you and other people what the function does, and how to use it. Try making a new cell and type plot_with_errors?. You’ll see this string is the help that Python will fetch for this function.

Now the same thing with the function:

plot_with_errors(easiness, quality, 2.25, 0.47)
0.12027697029219235
../_images/758b0c442ef1f4546ef43a24858c5d02170afcb1dfc7873f176e20a559902055.png

If we try a different intercept, we’ll get a different line, and a different error. Let’s try an intercept of 2.1:

plot_with_errors(easiness, quality, 2.1, 0.47)
0.18316435491606842
../_images/4bf9a108a1889b6899d06dc6f4612922eb03efa65fc65c02d5289117475748b3.png

Or, we could go back to the same intercept, but try a different slope, and we’d get a different error. Let’s try a slope of 0.5:

plot_with_errors(easiness, quality, 2.25, 0.5)
0.16024697296024795
../_images/8f695ffe78ba99e8eb3fa53be98a412d6dbdc9f0571270b2027cbc07677132e1.png

Now we use the long slow method to find the best slope for our initial intercept of 2.25. You may recognize the following from the mean and slopes notebook.

First we make a function, a bit like the function above, that gives us the error for any given intercept (c) and slope (s) like this:

def calc_rmse_for_c_s(c, s):
    predicted_quality = c + easiness * s
    errors = quality - predicted_quality
    return np.sqrt(np.mean(errors ** 2))

We have already calculated the error for the original guess at slope and intercept, but let’s do it again for practice:

# RMSE for our initial guessed line.
predicted = 2.25 + easiness * 0.47
error = quality - predicted
np.sqrt(np.mean(error ** 2))
0.12027697029219235

Check that our function gives the same value for the same intercept and slope:

calc_rmse_for_c_s(2.25, 0.47)
0.12027697029219235

OK, now we use this function to try lots of different slopes, for this intercept, and see which slope gives us the lowest error. See the means, slopes notebook for the first time we did this.

some_slopes = np.arange(-2, 2, 0.001)
slope_df = pd.DataFrame()
slope_df['slope_to_try'] = some_slopes
slope_df
slope_to_try
0 -2.000
1 -1.999
2 -1.998
3 -1.997
4 -1.996
... ...
3995 1.995
3996 1.996
3997 1.997
3998 1.998
3999 1.999

4000 rows × 1 columns

As for the mean and slopes notebook, we make an array to hold the RMSE results for each slope.

# Number of slopes.
n_slopes = len(some_slopes)
# Array to store the RMSE for each slope.
rmses = np.zeros(n_slopes)

Then (as before) we calculate and store the RMSE value for each slope, with this given intercept.

# Try all these slopes for an intercept of 2.25
for i in np.arange(n_slopes):  # For each candidate slope.
    # Get the corresponding slope.
    this_slope = some_slopes[i]
    # Calculate the error measure for this slope and intercept 2.25.
    this_error = calc_rmse_for_c_s(2.25, this_slope)
    # Put the error into the results array at the corresponding position.
    rmses[i] = this_error

# Put all the RMSE scores into the DataFrame as a column for display.
slope_df['RMSE'] = rmses
slope_df
slope_to_try RMSE
0 -2.000 7.885813
1 -1.999 7.882617
2 -1.998 7.879421
3 -1.997 7.876225
4 -1.996 7.873028
... ... ...
3995 1.995 4.886684
3996 1.996 4.889880
3997 1.997 4.893075
3998 1.998 4.896271
3999 1.999 4.899466

4000 rows × 2 columns

We plot the errors we got for each slope, and find the slope giving the smallest error:

plt.plot(slope_df['slope_to_try'], slope_df['RMSE'])
plt.xlabel('Candidate slopes')
plt.ylabel('Root mean squared error')
Text(0, 0.5, 'Root mean squared error')
../_images/d795f0641de6368baed0b6ade13ccc256b96ed3c10532089dcb93a8e21109735.png

Find the row corresponding to the smallest RMSE:

# Row label corresponding to minimum value.
row_with_min = slope_df['RMSE'].idxmin()
slope_df.loc[row_with_min]
slope_to_try    0.46700
RMSE            0.11982
Name: 2467, dtype: float64
# Slope giving smallest RMSE for intercept 2.25
slope_df.loc[row_with_min, 'RMSE']
0.119820093530566

OK — that’s the best slope for an intercept of 2.25. How about our other suggestion, of an intercept of 2.1? Let’s try that:

# Try all candidate slopes for an intercept of 2.1.
# We will reuse "rmses" and "slope_df" for simplicity.
for i in np.arange(n_slopes):  # For each candidate slope.
    # Get the corresponding slope.
    this_slope = some_slopes[i]
    # Calculate the error measure for this slope and intercept 2.21.
    this_error = calc_rmse_for_c_s(2.1, this_slope)
    # Put the error into the results array at the corresponding position.
    rmses[i] = this_error

# Put all the RMSE scores into the DataFrame as a column for display.
slope_df['RMSE'] = rmses
slope_df
slope_to_try RMSE
0 -2.000 8.035281
1 -1.999 8.032085
2 -1.998 8.028889
3 -1.997 8.025693
4 -1.996 8.022497
... ... ...
3995 1.995 4.737225
3996 1.996 4.740420
3997 1.997 4.743616
3998 1.998 4.746811
3999 1.999 4.750007

4000 rows × 2 columns

# Recalculate row holding minimum RMSE
row_with_min_for_2p1 = slope_df['RMSE'].idxmin()
slope_df.loc[row_with_min_for_2p1]
slope_to_try    0.513000
RMSE            0.119317
Name: 2513, dtype: float64

Oh dear - the best slope has changed. And, in general, for any intercept, you may able to see that the best slope will be different, as the slope tries to adjust for the stuff that the intercept does not explain.

This means we can’t just find the intercept, and find the best slope for that intercept, at least not in our case - we have to find the best pair of intercept and slope. This is the pair, of all possible pairs, that gives the lowest error.

Our task then, is to find the pair of valuesc and s — such that we get the smallest possible value for the root mean squared errors above.

One way of doing this, is to try every possible plausible pair of intercept and slope, calculate the error for this pair, and then find the pair that gave the lowest error.

We are now searching over many pairs of slopes and intercepts.

Here are some candidate intercepts to try:

# Some intercepts to try
some_inters = np.arange(1, 3.2, 0.01)
inter_df = pd.DataFrame()
inter_df['intercept_to_try'] = some_inters
inter_df
intercept_to_try
0 1.00
1 1.01
2 1.02
3 1.03
4 1.04
... ...
215 3.15
216 3.16
217 3.17
218 3.18
219 3.19

220 rows × 1 columns

We would like to calculate RMSE values for all combinations of slopes and intercepts. That is, we want to get RMSE values for all slopes, for the first intercept value (1), and RMSE values for all slopes, and the second intercept value (1.01), and so on.

We have:

print('Number of slopes to try:', n_slopes)
n_inters = len(some_inters)
print('Number of intercepts to try:', n_inters)
n_pairs = n_slopes * n_inters
print('Number of pairs to try:', n_pairs)
Number of slopes to try: 4000
Number of intercepts to try: 220
Number of pairs to try: 880000

To do this, we could make a new DataFrame with one row per intercept, slope pair. The intercept column inter_to_try will have the intercept values, and the second slope_to_try will have the slope values.

Let’s repeat each intercept value n_slopes times, using np.repeat. As you remember, np.repeat repeats each element in the array by a given number of times:

# Repeat each element in the array 3 times.
np.repeat([1, 2, 3], 3)
array([1, 1, 1, 2, 2, 2, 3, 3, 3])

To make the column of intercepts we repeat each element of the intercepts array n_slopes times:

pair_df = pd.DataFrame()
repeated_inters = np.repeat(some_inters, n_slopes)
# Show in DataFrame
pair_df['inter_to_try'] = repeated_inters
pair_df
inter_to_try
0 1.00
1 1.00
2 1.00
3 1.00
4 1.00
... ...
879995 3.19
879996 3.19
879997 3.19
879998 3.19
879999 3.19

880000 rows × 1 columns

To make the slope_to_try column, we need to repeat the whole some_slopes array n_inters times. Notice that np.repeat repeats each element a given number of times. To repeat the whole array, we can use np.tile, like this:

# Repeat the whole array 3 times.
np.tile([1, 2, 3], 3)
array([1, 2, 3, 1, 2, 3, 1, 2, 3])

To repeat the whole some_slopes array n_inters times:

tiled_slopes = np.tile(some_slopes, n_inters)
# Show in DataFrame
pair_df['slope_to_try'] = tiled_slopes
pair_df
inter_to_try slope_to_try
0 1.00 -2.000
1 1.00 -1.999
2 1.00 -1.998
3 1.00 -1.997
4 1.00 -1.996
... ... ...
879995 3.19 1.995
879996 3.19 1.996
879997 3.19 1.997
879998 3.19 1.998
879999 3.19 1.999

880000 rows × 2 columns

The rows of this DataFrame (and the arrays that went into them) define all the possible intercept, slope pairs.

We next make another array to contains the RMSE values for each pair.

# An array to store the RMSE values for each intercept, slope pair.
pair_rmses = np.zeros(n_pairs)

We can use these arrays to go through each slope, intercept pair in turn, calculate the RMSE, and store it for later use.

# Go through each pair to calculate the corresponding RMSE.
# (Or, equivalently, go through the values for each row in the `pair_df`
# DataFrame).
for i in np.arange(n_pairs):
    # Get the intercept for this pair (intercept at this position).
    this_intercept = repeated_inters[i]
    # Get the slope for this pair (slope at this position).
    this_slope = tiled_slopes[i]
    # Calculate the error measure.
    this_error = calc_rmse_for_c_s(this_intercept, this_slope)
    # Put the error into the RMSE results array at this position.
    pair_rmses[i] = this_error

# Add the RMSE column to the original DataFrame for display.
pair_df['RMSE'] = pair_rmses
pair_df
inter_to_try slope_to_try RMSE
0 1.00 -2.000 9.131916
1 1.00 -1.999 9.128720
2 1.00 -1.998 9.125524
3 1.00 -1.997 9.122328
4 1.00 -1.996 9.119132
... ... ... ...
879995 3.19 1.995 5.823930
879996 3.19 1.996 5.827125
879997 3.19 1.997 5.830321
879998 3.19 1.998 5.833516
879999 3.19 1.999 5.836712

880000 rows × 3 columns

For each of the n_inters possible intercepts, we have tried all n_slopes possible slopes, and we have found the RMSE for each intercept, slope pair. We can now find the pair that gives the lowest RMSE.

# The lowest error that we found for any slope, intercept pair
min_row_label = pair_df['RMSE'].idxmin()
best_row = pair_df.loc[min_row_label]
best_row
inter_to_try    2.120000
slope_to_try    0.507000
RMSE            0.119304
Name: 450507, dtype: float64

Notice that the error (RMSE) for this pair is lower than the error we found for our guessed c and s:

calc_rmse_for_c_s(2.25, 0.47)
0.12027697029219235
# The intercept corresponding to the lowest error
best_intercept = best_row['inter_to_try']
best_slope = best_row['slope_to_try']
print('Best intercept, slope pair:', best_intercept, best_slope)
Best intercept, slope pair: 2.120000000000001 0.5069999999997239

Plot the data, predictions and errors for the line that minimizes the sum of squared error:

plot_with_errors(easiness, quality, best_intercept, best_slope)
0.11930386031906048
../_images/969f15ad43d43c9728f7b3ad16c019344ed548ecaea660f057c1eff89bd9bd79.png

Now you know about optimization, you will not be surprised to discover that Scipy minimize can also do the search for the intercept and slope pair for us. We send minimize the function we are trying to minimize, and a starting guess for the intercept and slope.

minimize is a little fussy about the functions it will use. It insists that all the parameters need to be passed in as a single argument. In our case, we need to pass both parameters (the intercept and slope) as one value, containing two elements, like this:

def calc_rmse_for_minimize(c_s):
    # c_s has two elements, the intercept c and the slope s.
    c = c_s[0]
    s = c_s[1]
    predicted_quality = c + easiness * s
    errors = quality - predicted_quality
    return np.sqrt(np.mean(errors ** 2))

This is the form of the function that minimize can use. See using minimize for more detail.

We first confirm this gives us the same answer we got before from our function with two arguments:

# The original function
calc_rmse_for_c_s(2.25, 0.47)
0.12027697029219235
# The function in the form that minimize wants
# The two parameters go into a list, that we can pass as a single argument.
calc_rmse_for_minimize([2.25, 0.47])
0.12027697029219235

As usual with minimize we need to give a starting guess for the intercept and slope. We will start with our initial guess of [2.25, 0.47], but any reasonable guess will do.

from scipy.optimize import minimize
minimize(calc_rmse_for_minimize, [2.25, 0.47])
  message: Optimization terminated successfully.
  success: True
   status: 0
      fun: 0.11930057876222537
        x: [ 2.115e+00  5.089e-01]
      nit: 4
      jac: [-2.442e-06  1.345e-06]
 hess_inv: [[ 1.773e+01 -5.527e+00]
            [-5.527e+00  1.735e+00]]
     nfev: 24
     njev: 8

Notice that minimize doesn’t get exactly the same result as we found with the long slow way above. This is because the long slow way only tested intercepts that were step-sizes of 0.01 apart (and slopes that were 0.001 apart). minimize can use much smaller step-sizes, and so finds a more accurate answer.

We won’t spend any time justifying this, but this is also the answer we get from traditional fitting of the least-squares line, as implemented, for example, in the Scipy linregress function:

from scipy.stats import linregress
linregress(easiness, quality)
LinregressResult(slope=0.5088641203940006, intercept=2.114798571643314, rvalue=0.7459725836595872, pvalue=1.5998303978986516e-14, stderr=0.053171209477256925, intercept_stderr=0.1699631063847937)

Notice the values for slope and intercept in the output above.