Finding lines

Finding lines#

In The Mean and Slopes, we were looking for the best slope to predict one vector of values from another vector of values.

Specifically, we wanted our slope to predict the Packed Cell Volume (PCV) values from the Hemoglobin (HGB) values.

By analogy with The Mean as Predictor, we decided to choose our line to minimize the average prediction errors, and the sum of squared prediction errors.

We found a solution, by trying many slopes, and choosing the slope giving use the smallest error.

For our question, we were happy to assume that the line passed through 0, 0 — meaning, that when the Hemoglobin is 0, the Packed Cell Volume value is 0. Put another way, we assumed that our line had an intercept value of 0. The intercept is the y value at which the line crosses the y axis, or, put another way, the y value when the x value is 0.

What if we are in a situation where we want to find a line that had a (not zero) intercept, as well as a slope?

import numpy as np
import pandas as pd
pd.set_option('mode.copy_on_write', True)
import matplotlib.pyplot as plt
# Make plots look a little bit more fancy
plt.style.use('fivethirtyeight')
# Print to 4 decimal places, show tiny values as 0
np.set_printoptions(precision=4, suppress=True)

We return to the students ratings dataset dataset.

This is a dataset, in Excel form, where each row is the average of students’ ratings from <RateMyProfessors.com> across a single subject. Thus, the first row refers to the average of all professors teaching English, the second row refers to all professors teaching Mathematics, and so on.

Download the data file via this link rate_my_course.csv.

Next we load the data.

# Read the data file
ratings = pd.read_csv('rate_my_course.csv')
ratings.head()
Discipline Number of Professors Clarity Helpfulness Overall Quality Easiness
0 English 23343 3.756147 3.821866 3.791364 3.162754
1 Mathematics 22394 3.487379 3.641526 3.566867 3.063322
2 Biology 11774 3.608331 3.701530 3.657641 2.710459
3 Psychology 11179 3.909520 3.887536 3.900949 3.316210
4 History 11145 3.788818 3.753642 3.773746 3.053803

We are interested in the relationship of the “Overall Quality” measure to the “Easiness” measure.

# Convert Easiness and Overall Quality measures to arrays.
easiness = np.array(ratings['Easiness'])
quality = np.array(ratings['Overall Quality'])

Do students rate easier courses as being of better quality?

plt.plot(easiness, quality, 'o')
plt.xlabel('Easiness')
plt.ylabel('Overall Quality')
Text(0, 0.5, 'Overall Quality')
../_images/dbe055478ee57730027409f23849b2852e715c1532725c0d65542da49b5fcfe5.png

There might be a straight-line relationship here, but it doesn’t look as if it would go through 0, 0:

# The same plot as above, but showing the x, y origin at 0, 0
plt.plot(easiness, quality, 'o')
plt.xlabel('Easiness')
plt.ylabel('Overall Quality')
# Make sure 0, 0 is on the plot.
plt.axis([0, 3.9, 0, 4.2])
(0.0, 3.9, 0.0, 4.2)
../_images/ecd9fce2bd141c609d2975b8cbe556d2c437f9e90734787cb3bc428df283ed11.png

In The Mean and Slopes, we assumed that the intercept was zero, so we only had to try different slopes to get our best line.

Here we have a different problem, because we want to find a line that has an intercept that is not zero, so we need to find the best slope and the best intercept at the same time. How do we search for a slope as well as an intercept?

But wait - why do we have to search for the slope and the intercept at the same time? Can’t we just find the best slope, and then the best intercept?

In fact we can’t do that, because the best slope will change for every intercept.

To see why that is, we need to try a few different lines. To do that, we need to remind ourselves about defining lines, and then testing them.

Remember, we can describe a line with an intercept and a slope. Call the intercept \(c\) and a slope \(s\). A line predicts the \(y\) values from the \(x\) values, using the slope \(s\) and the intercept \(c\):

\[ y = c + x * s \]

Let’s start with a guess for the line, just from eyeballing the plot. We guess that:

  • The intercept is around 2.25

  • The slope is around 0.47

The predicted \(y\) values from this line are (from the formula above):

predicted = 2.25 + easiness * 0.47

where easiness contains our actual \(x\) values.

The prediction error at each point come from the actual \(y\) values minus the predicted \(y\) values.

errors = quality - predicted

where quality contains our actual \(y\) values.

We can look at the predictions for this line (in red), and the actual values (in blue) and then the errors (the lengths of the dotted lines joining the red predictions and the corresponding blue actual values).

# Don't worry about this code, it's just to plot the line, and the errors.
x_values = easiness  # The thing we're predicting from, on the x axis
y_values = quality  # The thing we're predicting, on the y axis.
plt.plot(x_values, y_values, 'o')
plt.plot(x_values, predicted, 'o', color='red')
# Draw a line between predicted and actual
for i in np.arange(len(x_values)):
    x = x_values[i]
    y_0 = predicted[i]
    y_1 = y_values[i]
    plt.plot([x, x], [y_0, y_1], ':', color='black', linewidth=1)
../_images/39f87b8271fd763324dd188fbb37704bab71afa76837aa4cd2e2f8bc8cccc778.png

The Root Mean Squared Error (RMSE) is:

# Root mean squared error given c and s
rmse_c_s = np.sqrt(np.mean(errors ** 2))
rmse_c_s
0.12027697029219235

Actually, those bits of code are so useful, let’s make them into a function:

def plot_with_errors(x_values, y_values, c, s):
    """ Plot a line through data with errors

    Parameters
    ----------
    x_values : array
        Values we are predicting from, for the x-axis of the plot.
    y_values : array
        Values we are predicting, for the y-axis of the plot.
    c : number
        Intercept for predicting line.
    s : number
        Slope for predicting line.

    Returns
    -------
    rmse : number
        The square root of the mean squared error, for the given `x_values`,
        `y_values` and line.
    """
    # Predict the y values from the line.
    predicted = c + s * x_values
    # Errors are the real values minus the predicted values.
    errors = y_values - predicted
    # Plot real values in blue, predicted values in red.
    plt.plot(x_values, y_values, 'o', color='blue')
    plt.plot(x_values, predicted, 'o', color='red')
    # Draw a line between predicted and actual
    for i in np.arange(len(x_values)):
        x = x_values[i]
        y_0 = predicted[i]
        y_1 = y_values[i]
        plt.plot([x, x], [y_0, y_1], ':', color='black', linewidth=1)
    return np.sqrt(np.mean(errors ** 2))

Notice the string at the top of the function, giving details about what the function does, its arguments, and the values it returns. This is called the docstring. It can remind you and other people what the function does, and how to use it. Try making a new cell and type plot_with_errors?. You’ll see this string is the help that Python will fetch for this function.

Now the same thing with the function:

plot_with_errors(easiness, quality, 2.25, 0.47)
0.12027697029219235
../_images/809d6b072c67a8ed89c1af2cd4b69f0402a40a1960df3dcadca9f33fee3991bb.png

If we try a different intercept, we’ll get a different line, and a different error. Let’s try an intercept of 2.1:

plot_with_errors(easiness, quality, 2.1, 0.47)
0.18316435491606842
../_images/4b41aea11376d3d840960a04bf02cdcbbf02f92bb8c228e4e72fb7b96883d777.png

Or, we could go back to the same intercept, but try a different slope, and we’d get a different error. Let’s try a slope of 0.5:

plot_with_errors(easiness, quality, 2.25, 0.5)
0.16024697296024795
../_images/6e7d920bdc2e5b851372146da3fc5566632e1b9b68c67b1344e537b4b177c50f.png

Now we use the long slow method to find the best slope for our initial intercept of 2.25. You may recognize the following from the mean and slopes notebook.

First we make a function, a bit like the function above, that gives us the error for any given intercept (c) and slope (s) like this:

def calc_rmse_for_c_s(c, s):
    predicted_quality = c + easiness * s
    errors = quality - predicted_quality
    return np.sqrt(np.mean(errors ** 2))

We have already calculated the error for the original guess at slope and intercept, but let’s do it again for practice:

# RMSE for our initial guessed line.
predicted = 2.25 + easiness * 0.47
error = quality - predicted
np.sqrt(np.mean(error ** 2))
0.12027697029219235

Check that our function gives the same value for the same intercept and slope:

calc_rmse_for_c_s(2.25, 0.47)
0.12027697029219235

OK, now we use this function to try lots of different slopes, for this intercept, and see which slope gives us the lowest error. See the means, slopes notebook for the first time we did this.

# Some slopes to try.
some_slopes = np.arange(-2, 2, 0.001)
n_slopes = len(some_slopes)
print('Number of slopes to try:', n_slopes)
# The first 10 slopes to try:
some_slopes[:10]
Number of slopes to try: 4000
array([-2.   , -1.999, -1.998, -1.997, -1.996, -1.995, -1.994, -1.993,
       -1.992, -1.991])
# Try all these slopes for an intercept of 2.25
# For each slope, calculate and record sum of squared error
rmses = np.zeros(n_slopes)
for i in np.arange(n_slopes):
    slope = some_slopes[i]
    this_error = calc_rmse_for_c_s(2.25, slope)
    # Record the error measure in error array.
    rmses[i] = this_error
min_pos = np.argmin(rmses)
least_error = rmses[min_pos]
best_slope_for_2p25 = some_slopes[min_pos]

Now plot the errors we got for each slope, and find the slope giving the smallest error:

plt.plot(some_slopes, rmses)
plt.xlabel('Candidate slopes')
plt.ylabel('RMSE')
Text(0, 0.5, 'RMSE')
../_images/441415a0d403866869a5d89996637f05ccce68610bcb877a404d5ad2af014f70.png
print('Best slope for intercept of', 2.25, 'is', best_slope_for_2p25)
print('Best slope for intercept', 2.25, 'gives error', least_error)
Best slope for intercept of 2.25 is 0.4669999999997283
Best slope for intercept 2.25 gives error 0.119820093530566

That code also looks useful, so let’s make some of that code into a function we can reuse:

def best_slope_for_intercept(intercept, some_slopes):
    """ Calculate best slope, lowest error for a given intercept

    Parameters
    ----------
    intercept : number
        Intercept.
    some_slopes : array
        Array of candidate slope values to try.

    Returns
    -------
    best_slope : float
        Slope from `some_slopes` that results in lowest error.
    """
    n_slopes = len(some_slopes)
    # Try all these slopes, calculate and record sum of squared error
    rmses = np.zeros(n_slopes)
    for i in np.arange(n_slopes):
        slope = some_slopes[i]
        this_error = calc_rmse_for_c_s(intercept, slope)
        rmses[i] = this_error
    min_pos = np.argmin(rmses)
    best_slope = some_slopes[min_pos]
    return best_slope

Now use the function to find the best slope:

# The best slope for intercept 2.25
best_for_2p25 = best_slope_for_intercept(2.25, some_slopes)
best_for_2p25
0.4669999999997283

OK — that’s the best slope for an intercept of 2.25. How about our other suggestion, of an intercept of 2.1? Let’s try that:

# The first value in returned array is the slope.
best_for_2p1 = best_slope_for_intercept(2.1, some_slopes)
best_for_2p1
0.5129999999997232

Oh dear - the best slope has changed. And, in general, for any intercept, you may able to see that the best slope will be different, as the slope tries to adjust for the stuff that the intercept does not explain.

This means we can’t just find the intercept, and find the best slope for that intercept, at least not in our case - we have to find the best pair of intercept and slope. This is the pair, of all possible pairs, that gives the lowest error.

Our task then, is to find the pair of valuesc and s — such that we get the smallest possible value for the root mean squared errors above.

One way of doing this, is to try every possible plausible pair of intercept and slope, calculate the error for this pair, and then find the pair that gave the lowest error.

We are now searching over many combinations of slopes and intercepts.

For example, say we were interested in trying the intercepts 2, 2.1, 2.2. Then we’d run the routine above for each intercept, to find the best slope for each:

best_2p0 = best_slope_for_intercept(2.0, some_slopes)
# Calculate error for this pair.
best_2p0_error = calc_rmse_for_c_s(2.0, best_2p0)
print('Best slope, error for 2.0 is ', best_2p0, best_2p0_error)
best_2p1 = best_slope_for_intercept(2.1, some_slopes)
best_2p1_error = calc_rmse_for_c_s(2.1, best_2p1)
print('Best slope, error for 2.1 is ', best_2p1, best_2p1_error)
best_2p2 = best_slope_for_intercept(2.2, some_slopes)
best_2p2_error = calc_rmse_for_c_s(2.2, best_2p2)
print('Best slope, error for 2.2 is ', best_2p2, best_2p2_error)
Best slope, error for 2.0 is  0.5449999999997197 0.11967782241242624
Best slope, error for 2.1 is  0.5129999999997232 0.11931655896393074
Best slope, error for 2.2 is  0.48199999999972665 0.1195095856510923

From this we conclude that, of the intercepts we have tried, 2.1 is the best, because we could get the lowest error score with that intercept. If this was all we had, we would chose an intercept of 2.1, and its matching best slope of 0.513.

To find out if this is really the best we can do, we can try many intercepts. For each intercept, we find the best slope, with the lowest error. Then we choose the intercept for which we can get the lowest error, and find the best slope for that intercept.

# Some intercepts to try
some_intercepts = np.arange(1, 3.2, 0.01)
n_intercepts = len(some_intercepts)
print('Number of intercepts to try:', n_intercepts)
# First 10 intercepts to try
print('First 10 intercepts', some_intercepts[:10])
Number of intercepts to try: 220
First 10 intercepts [1.   1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08 1.09]

For each of the 220 possible intercepts, we try all 4000 possible slopes, to find the slope giving the lowest error for that intercept. We store the best slope, and the best error, for each intercept, so we can chose the best intercept, after we have finished.

# An array to collect the best slope found for each intercept.
best_slopes = np.zeros(n_intercepts)
# An array to collect the lowest error found for each intercept.
# This is the error associated with the matching slope above.
lowest_errors = np.zeros(n_intercepts)
# Cycle through each intercept, finding the best slope, and lowest error.
for i in np.arange(n_intercepts):
    # Intercept to try
    intercept = some_intercepts[i]
    # Find best slope
    best_slope = best_slope_for_intercept(intercept, some_slopes)
    # Calculate the error for this best_slope, intercept pair.
    lowest_error = calc_rmse_for_c_s(intercept, best_slope)
    # Store the best_slope and error
    best_slopes[i] = best_slope
    lowest_errors[i] = lowest_error
print('First 10 intercepts:\n', some_intercepts[:10])
print('Best slopes for first 10 intercepts:\n', best_slopes[:10])
print('Lowest errors for first 10 intercepts:\n', lowest_errors[:10])
First 10 intercepts:
 [1.   1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08 1.09]
Best slopes for first 10 intercepts:
 [0.856 0.853 0.85  0.847 0.844 0.841 0.838 0.835 0.831 0.828]
Lowest errors for first 10 intercepts:
 [0.1504 0.1499 0.1494 0.1489 0.1484 0.1479 0.1475 0.147  0.1465 0.146 ]
# Plot the lowest error for each intercept
plt.plot(some_intercepts, lowest_errors)
plt.xlabel('Intercepts')
plt.ylabel('Lowest error for intercept')
plt.title('Lowest error for each intercept')
Text(0.5, 1.0, 'Lowest error for each intercept')
../_images/80d1a716acc8eeb58a407409f5db103b1a8c894ce893525c8da369a75fd60c3c.png
# The lowest error we found for any intercept:
print('Least error', np.min(lowest_errors))
Least error 0.11930386031906048

Notice that this error is lower than the error we found for our guessed c and s:

calc_rmse_for_c_s(2.25, 0.47)
0.12027697029219235

We can go back and get the corresponding intercept and slope.

# The intercept corresponding to the lowest error
min_pos = np.argmin(lowest_errors)
best_intercept = some_intercepts[min_pos]
best_intercept
2.120000000000001
# The slope giving the lowest error, for this intercept
best_slope = best_slopes[min_pos]
best_slope
0.5069999999997239

Plot the data, predictions and errors for the line that minimizes the sum of squared error:

plot_with_errors(easiness, quality, best_intercept, best_slope)
0.11930386031906048
../_images/e1b93fef63604fd769215c17304aeeb2626efb83bfc6d58ea1a0468c061b43c5.png

Now you know about optimization, you will not be surprised to discover that Scipy minimize can also do the search for the intercept and slope pair for us. We send minimize the function we are trying to minimize, and a starting guess for the intercept and slope.

minimize is a little fussy about the functions it will use. It insists that all the parameters need to be passed in as a single argument. In our case, we need to pass both parameters (the intercept and slope) as one value, containing two elements, like this:

def calc_rmse_for_minimize(c_s):
    # c_s has two elements, the intercept c and the slope s.
    c = c_s[0]
    s = c_s[1]
    predicted_quality = c + easiness * s
    errors = quality - predicted_quality
    return np.sqrt(np.mean(errors ** 2))

This is the form of the function that minimize can use. See using minimize for more detail.

We first confirm this gives us the same answer we got before from our function with two arguments:

# The original function
calc_rmse_for_c_s(2.25, 0.47)
0.12027697029219235
# The function in the form that minimize wants
# The two parameters go into a list, that we can pass as a single argument.
calc_rmse_for_minimize([2.25, 0.47])
0.12027697029219235

As usual with minimize we need to give a starting guess for the intercept and slope. We will start with our initial guess of [2.25, 0.47], but any reasonable guess will do.

from scipy.optimize import minimize
minimize(calc_rmse_for_minimize, [2.25, 0.47])
  message: Optimization terminated successfully.
  success: True
   status: 0
      fun: 0.11930057876222537
        x: [ 2.115e+00  5.089e-01]
      nit: 4
      jac: [-2.442e-06  1.345e-06]
 hess_inv: [[ 1.773e+01 -5.527e+00]
            [-5.527e+00  1.735e+00]]
     nfev: 24
     njev: 8

Notice that minimize doesn’t get exactly the same result as we found with the long slow way above. This is because the long slow way only tested intercepts that were step-sizes of 0.01 apart (and slopes that were 0.001 apart). minimize can use much smaller step-sizes, and so finds a more accurate answer.

We won’t spend any time justifying this, but this is also the answer we get from traditional fitting of the least-squares line, as implemented, for example, in the Scipy linregress function:

from scipy.stats import linregress
linregress(easiness, quality)
LinregressResult(slope=0.5088641203940006, intercept=2.114798571643314, rvalue=0.7459725836595872, pvalue=1.5998303978986516e-14, stderr=0.053171209477256925, intercept_stderr=0.1699631063847937)

Notice the values for slope and intercept in the output above.