2D arrays, images, optimization

Contents

2D arrays, images, optimization#

Here we use a 2D array to look at the search for the best (least-squares) line for a simple regression.

import numpy as np
import pandas as pd
pd.set_option('mode.copy_on_write', True)
import matplotlib.pyplot as plt

We return again to the income and prestige data from Duncan [Dun61].

df = pd.read_csv('data/Duncan_Occupational_Prestige.csv')
# Select the first 15 rows (occupations).
top_15 = df.head(15)
top_15
name type income education prestige
0 accountant prof 62 86 82
1 pilot prof 72 76 83
2 architect prof 75 92 90
3 author prof 55 90 76
4 chemist prof 64 86 90
5 minister prof 21 84 87
6 professor prof 64 93 93
7 dentist prof 80 100 90
8 reporter wc 67 87 52
9 engineer prof 72 86 88
10 undertaker prof 42 74 57
11 lawyer prof 76 98 89
12 physician prof 76 97 97
13 welfare.worker prof 41 84 59
14 teacher prof 48 91 73

We will be doing a regression predicting 'prestige' as a function of 'income'.

x = top_15['income']
y = top_15['prestige']

Plot these observations.

plt.scatter(x, y)
plt.xlabel('Income')
plt.ylabel('Prestige')
Text(0, 0.5, 'Prestige')
_images/e930383a24ff649b64ecf4c5df547ceb161c98461649688f4b1596daca4cc06a.png

Start with a guess for the slope and intercept of a line to predict the y ('prestige') values from the x ('income') values.

b = 0.7  # Guessed slope.
c = 30  # Guessed intercept.

To find the best least-squares slope, we first need a function to calculate the Root Mean Squared Error (RMSE) for a given slope and intercept, and given x and y vectors.

We’ve chosen RMSE as the cost (or loss) function here, but we could equally have chosen the sum of squared errors (SSE) or the mean squared error (MSE); these measures go up and down together, so when we have found the slope and intercept minimizing the RMSE, that same slope and intercept will minimize SSE and MSE.

def rmse_for_b_and_c(b, c, x, y):
    """ Root mean squared error for slope `b` intercept `c`
    """
    y_hat = b * x + c
    e = y - y_hat
    e2 = e ** 2
    return np.sqrt(np.mean(e2))

We calculate the RMSE for our guessed slope and intercept:

rmse_for_b_and_c(b, c, x, y)
np.float64(15.398809477791893)

Just to make sure, we calculated the RMSE again, outside the function, to make sure we’re returning the correct result:

y_hat = b * x + c
e = y - y_hat
np.sqrt(np.mean(e ** 2))
np.float64(15.398809477791893)

We are now going to do a brute-force search for the best RMSE slope an intercept.

A brute-force search is where we search a wide range of slope and intercept pairs and select the pair that gives the lowest RMSE.

Alternative methods are using scipy.optimize.minimize to select sensible slopes and intercepts to try for us, or to use the mathematical short-cuts of routines like scipy.statistics.linregress. These last will work for our case — looking for the slope, intercept minimizing the squared error (or RMSE in our case), but we can’t use them for other calculations of the error.

First we select some slopes to try, based on looking at the scatterplot above.

# Make 250 slopes by taking equally spaced values starting at 0
# and ending at 0.5.
candidate_slopes = np.linspace(0, 0.5, 250)
candidate_slopes[:10]
array([0.        , 0.00200803, 0.00401606, 0.0060241 , 0.00803213,
       0.01004016, 0.01204819, 0.01405622, 0.01606426, 0.01807229])
# Make 250 intercepts by taking equally spaced values starting at 40
# and ending at 70.
candidate_inters = np.linspace(40, 70, 250)
candidate_inters[:10]
array([40.        , 40.12048193, 40.24096386, 40.36144578, 40.48192771,
       40.60240964, 40.72289157, 40.84337349, 40.96385542, 41.08433735])

These are the number of candidate slopes and intercepts. In our case, the np.linspace calls above dictate these numbers will be 250.

n_slopes = len(candidate_slopes)
n_inters = len(candidate_inters)

Make a two-dimensional array to store the RMSE values corresponding to each slope / intercept pair.

rmses = np.zeros([n_slopes, n_inters])
rmses.shape
(250, 250)

We are next going to start using Python’s range to make sequences of integer values in our for loops.

Up until now, we have been using Numpys arange to make sequential arrays of numbers, and feeding these into our for loops:

np.arange(1, 11)
array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10])
# The old way
for i in np.arange(1, 11):
    print(i)
1
2
3
4
5
6
7
8
9
10

Now we are going to switch to the more common Python idiom of using range in our for loops, instead of np.arange. range takes the same arguments as does np.arange — start, stop and step — but returns a range object.

my_r = range(1, 11)
my_r
range(1, 11)
type(my_r)
range
len(my_r)
10

my_r here is an object that refers to the sequence of numbers 1, 2, … 10 — but does not directly contain this sequence.

We can force the range object to create and return the corresponding numbers using e.g. list. list asks the range object for it’s sequence, and range returns them one by one, from which list compiles the list.

list(my_r)
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

We will use range in place of np.arange as the right-hand-side of for loops — providing the sequence from which we will fill the loop variable. In the next cell, the loop variable is v:

for v in range(5):
    print(v)
0
1
2
3
4

Back to the problem. We loop through each slope in turn. For each slope value we loop through each intercept value in turn. For the given slope and intercept, we calculate the RMSE value, and store it in corresponding location in the rmses array, where the row position comes from the position of the slope value, and the column position comes from the position of the intercept value.

for slope_no in range(n_slopes):
    b = candidate_slopes[slope_no]  # Get the corresponding slope.
    for inter_no in range(n_inters):
        c = candidate_inters[inter_no]  # Get the corresponding intercept.
        rmse = rmse_for_b_and_c(b, c, x, y)  # Calculate RMSE.
        rmses[slope_no, inter_no] = rmse  # Store in rmses array.

If we plt.imshow the rmses array, we see a dark valley corresponding to low values of RMSE for the corresponding slope, intercept pairs. Bright areas are for much higher RMSEs.

plt.imshow(rmses, cmap='gray')
plt.ylabel('Index for slopes')
plt.xlabel('Index for intercepts')
Text(0.5, 0, 'Index for intercepts')
_images/02615b2382e1df60d2c09925985b4f166ed100b1c677fec610043abbab8b0178.png

Next we want to find the slope and intercept corresponding to the minimum RMSE value we found in the rmses array.

To do this, we are going to need np.where. It can tell us the location of values in terms of index position.

Consider a one-dimensional array like this:

an_arr = np.array([1, 32, 99, 4, 2])
an_arr
array([ 1, 32, 99,  4,  2])

We can make a Boolean array with True where the values are equal to 99, and False otherwise:

an_arr == 99
array([False, False,  True, False, False])

np.where calculates where the True value(s) are, in terms of index position. In our case, the 99 value is at position (index) 3 in the array:

np.where(an_arr == 99)
(array([2]),)

We can apply the same logic to two-dimensions.

First we find the minimum RMSE value over the whole array:

min_rmse = np.min(rmses)
min_rmse
np.float64(12.24410485663457)

Next we calculate a 2D Boolean array, of the same shape as rmses, but having True in the element corresponding the minimum value, and False otherwise.

is_min = rmses == min_rmse
is_min[:10, :10]
array([[False, False, False, False, False, False, False, False, False,
        False],
       [False, False, False, False, False, False, False, False, False,
        False],
       [False, False, False, False, False, False, False, False, False,
        False],
       [False, False, False, False, False, False, False, False, False,
        False],
       [False, False, False, False, False, False, False, False, False,
        False],
       [False, False, False, False, False, False, False, False, False,
        False],
       [False, False, False, False, False, False, False, False, False,
        False],
       [False, False, False, False, False, False, False, False, False,
        False],
       [False, False, False, False, False, False, False, False, False,
        False],
       [False, False, False, False, False, False, False, False, False,
        False]])

Again, plt.imshow will show us the single True value as a white dot. All the other values in this array are black (False).

plt.imshow(is_min, cmap='gray', interpolation=None)
<matplotlib.image.AxesImage at 0x7f5f5fcf2050>
_images/4cb1fdddab481a6e9feb028e69991c109e66614cfda89f460a59e6968c8ffc0b.png

Back to np.where. If we call this on the is_min 2D Boolean array, it returns the row and column indices of the single True value in the array. Notice the use of unpacking to unpack the two returned arrays (row coordinates and column coordinates).

min_row, min_col = np.where(is_min)
min_row, min_col
(array([185]), array([147]))

Now we know the row position of the minimum, we can work back to the corresponding slope value:

best_slope = candidate_slopes[min_row]
best_slope
array([0.37148594])

Similarly, for the corresponding intercept value:

best_inter = candidate_inters[min_col]
best_inter
array([57.71084337])

Let’s run the mathematical calculation to find the best slope and intercept.

import scipy.stats as sps
res = sps.linregress(x, y)
res
LinregressResult(slope=np.float64(0.37105943152454773), intercept=np.float64(57.7653746770026), rvalue=np.float64(0.4376748771939141), pvalue=np.float64(0.10276532267999378), stderr=np.float64(0.21141912223374978), intercept_stderr=np.float64(13.336173440465936))

Notice that the mathematical (Scipy) calculation differs slightly from our brute-force search values. This is because our brute-force search didn’t try every possible slope and intercept value, only the values we got from our np.linspace calls above.

As we’d expect though, the RMSE from the Scipy’s high-precision estimate is a tiny-bit lower than our low-precision np.linspace estimate:

rmse_for_b_and_c(best_slope, best_inter, x, y)
np.float64(12.24410485663457)
rmse_for_b_and_c(res.slope, res.intercept, x, y)
np.float64(12.244069738260368)

References#

Dun61

Otis Dudley Duncan. A socioeconomic index for all occupations. In Albert J Reiss, Otis Dudley Duncan, Paul K Hatt, and Douglass Cecil North, editors, Occupations and social status, pages 109–161. Free Press, New York, 1961.