Training and testing#

Hide code cell content
# HIDDEN
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import pandas as pd
pd.set_option('mode.copy_on_write', True)
Hide code cell content
# HIDDEN
def standard_units(x):
    return (x - np.mean(x))/np.std(x)
Hide code cell content
# HIDDEN
def distance(point1, point2):
    """The distance between two arrays of numbers."""
    return np.sqrt(np.sum((point1 - point2)**2))

def all_distances(training, point):
    """The distance between p (an array of numbers) and the numbers in row i of attribute_table."""
    attributes = training.drop(columns='Class')
    def distance_from_point(row):
        return distance(point, np.array(row))
    return attributes.apply(distance_from_point, axis=1)

def table_with_distances(training, point):
    """A copy of the training table with the distance from each row to array p."""
    out = training.copy()
    out['Distance'] = all_distances(training, point)
    return out

def closest(training, point, k):
    """A table containing the k closest rows in the training table to array p."""
    with_dists = table_with_distances(training, point)
    sorted_by_distance = with_dists.sort_values('Distance')
    topk = sorted_by_distance.iloc[np.arange(k)]
    return topk

def majority(topkclasses):
    """1 if the majority of the class values are 1s, 0 otherwise."""
    ones = np.count_nonzero(topkclasses == 1)
    zeros = np.count_nonzero(topkclasses == 0)
    if ones > zeros:
        return 1
    else:
        return 0

def classify(training, p, k):
    """Classify an example with attributes p using k-nearest neighbor classification with the given training table."""
    closestk = closest(training, p, k)
    return majority(closestk['Class'])
Hide code cell content
# HIDDEN
def classify_grid(training, test, k):
    n_test = len(test)
    c = np.zeros(n_test)
    for i in range(n_test):
        # Run the classifier on the ith patient in the test set
        c[i] = classify(training, np.array(test.iloc[i]), k)
    return c

Again - if you are running on your laptop, you should download the ckd.csv file to the same directory as this notebook.

Hide code cell content
# HIDDEN
ckd_full = pd.read_csv('ckd.csv')
ckd = pd.DataFrame()
ckd['Hemoglobin'] = standard_units(ckd_full['Hemoglobin'])
ckd['Glucose'] = standard_units(ckd_full['Blood Glucose Random'])
ckd['WBC'] = standard_units(ckd_full['White Blood Cell Count'])
ckd['Class'] = ckd_full['Class']
ckd['Color'] = 'darkblue'
ckd.loc[ckd['Class'] == 0, 'Color'] = 'gold'

How good is our nearest neighbor classifier? To answer this we’ll need to find out how frequently our classifications are correct. If a patient has chronic kidney disease, how likely is our classifier to pick that up?

If the patient is in our training set, we can find out immediately. We already know what class the patient is in. So we can just compare our prediction and the patient’s true class.

But the point of the classifier is to make predictions for new patients not in our training set. We don’t know what class these patients are in but we can make a prediction based on our classifier. How to find out whether the prediction is correct?

One way is to wait for further medical tests on the patient and then check whether or not our prediction agrees with the test results. With that approach, by the time we can say how likely our prediction is to be accurate, it is no longer useful for helping the patient.

Instead, we will try our classifier on some patients whose true classes are known. Then, we will compute the proportion of the time our classifier was correct. This proportion will serve as an estimate of the proportion of all new patients whose class our classifier will accurately predict. This is called testing.

Overly Optimistic “Testing”#

The training set offers a very tempting set of patients on whom to test out our classifier, because we know the class of each patient in the training set.

But let’s be careful … there will be pitfalls ahead if we take this path. An example will show us why.

Suppose we use a 1-nearest neighbor classifier to predict whether a patient has chronic kidney disease, based on glucose and white blood cell count.

ckd.plot.scatter('WBC', 'Glucose', c=ckd['Color']);
../_images/05ef53a008d742e98027d0901f838d42c142f5c28c231ac39133c56a7238e1b0.png

Earlier, we said that we expect to get some classifications wrong, because there’s some intermingling of blue and gold points in the lower-left.

But what about the points in the training set, that is, the points already on the scatter? Will we ever mis-classify them?

The answer is no. Remember that 1-nearest neighbor classification looks for the point in the training set that is nearest to the point being classified. Well, if the point being classified is already in the training set, then its nearest neighbor in the training set is itself! And therefore it will be classified as its own color, which will be correct because each point in the training set is already correctly colored.

In other words, if we use our training set to “test” our 1-nearest neighbor classifier, the classifier will pass the test 100% of the time.

Mission accomplished. What a great classifier!

No, not so much. A new point in the lower-left might easily be mis-classified, as we noted earlier. “100% accuracy” was a nice dream while it lasted.

The lesson of this example is not to use the training set to test a classifier that is based on it.

Generating a Test Set#

In earlier chapters, we saw that random sampling could be used to estimate the proportion of individuals in a population that met some criterion. Unfortunately, we have just seen that the training set is not like a random sample from the population of all patients, in one important respect: Our classifier guesses correctly for a higher proportion of individuals in the training set than it does for individuals in the population.

When we computed confidence intervals for numerical parameters, we wanted to have many new random samples from a population, but we only had access to a single sample. We solved that problem by taking bootstrap resamples from our sample.

We will use an analogous idea to test our classifier. We will create two samples out of the original training set, use one of the samples as our training set, and the other one for testing.

So we will have three groups of individuals:

  • a training set on which we can do any amount of exploration to build our classifier;

  • a separate testing set on which to try out our classifier and see what fraction of times it classifies correctly;

  • the underlying population of individuals for whom we don’t know the true classes; the hope is that our classifier will succeed about as well for these individuals as it did for our testing set.

How to generate the training and testing sets? You’ve guessed it – we’ll select at random.

There are 158 individuals in ckd. Let’s use a random half of them for training and the other half for testing. To do this, we’ll shuffle all the rows, take the first 79 as the training set, and the remaining 79 for testing.

n_rows = len(ckd)
shuffled_ckd = ckd.sample(n_rows, replace=False)
training = shuffled_ckd.iloc[:79]
testing = shuffled_ckd.iloc[79:]

Now let’s construct our classifier based on the points in the training sample:

training.plot.scatter('WBC', 'Glucose', c=training['Color'])
plt.xlim(-2, 6)
plt.ylim(-2, 6);
../_images/be4638ff156770277c0eecd82e7992c9dfba04b891d817c46c1ee4ef24bf690e.png

We get the following classification regions and decision boundary:

Hide code cell content
# HIDDEN
x_vals = np.arange(-2, 6.1, 0.25)
n_x = len(x_vals)
y_vals = np.arange(-2, 6.1, 0.25)
n_y = len(y_vals)

# Make the vector of x and corresponding y coordinates to cover
# the whole 2D grid.  For the expert way, see Numpy meshgrid.

# Repeat each value of the x coordinate n_y times
x_coords = np.repeat(x_vals, n_y)
# Repeat the whole y_vals vector n_x times
y_coords = np.tile(y_vals, n_x)

test_grid = pd.DataFrame()
test_grid['WBC'] = x_coords
test_grid['Glucose'] = y_coords
Hide code cell content
# HIDDEN
c = classify_grid(training[['WBC', 'Glucose', 'Class']],
                  test_grid,
                  1)
../_images/33a268603b53c4d0940ba58328847ff6b42fd1eedf4cc9529cf797b5e14f3f4f.png

Place the test data on this graph and you can see at once that while the classifier got almost all the points right, there are some mistakes. For example, some blue points of the test set fall in the gold region of the classifier.

../_images/248f428356be7ca2616f369bf9a2862721aece74b2f74baa2b62b2d3b624e9c0.png

Some errors notwithstanding, it looks like the classifier does fairly well on the test set. Assuming that the original sample was drawn randomly from the underlying population, the hope is that the classifier will perform with similar accuracy on the overall population, since the test set was chosen randomly from the original sample.

Note

This page has content from the Training_and_Testing notebook of an older version of the UC Berkeley data science course. See the Berkeley course section of the license file.