Hide code cell content
# Don't change this cell; just run it.
import numpy as np
import pandas as pd
# Safe setting for Pandas.  Needs Pandas version >= 1.5.
pd.set_option('mode.copy_on_write', True)
import matplotlib.pyplot as plt
# Make the plots look more fancy.
plt.style.use('fivethirtyeight')

The Bootstrap#

Note

This page has content from the Bootstrap notebook of an older version of the UC Berkeley data science course. See the Berkeley course section of the license file.

A data scientist is using the data in a random sample to estimate an unknown parameter. She uses the sample to calculate the value of a statistic that she will use as her estimate.

Once she has calculated the observed value of her statistic, she could just present it as her estimate and go on her merry way. But she’s a data scientist. She knows that her random sample is just one of numerous possible random samples, and thus her estimate is just one of numerous plausible estimates.

By how much could those estimates vary? To answer this, it appears as though she needs to draw another sample from the population, and compute a new estimate based on the new sample. But she doesn’t have the resources to go back to the population and draw another sample.

It looks as though the data scientist is stuck.

Fortunately, a brilliant idea called the bootstrap can help her out. Since it is not feasible to generate new samples from the population, the bootstrap generates new random samples by a method called resampling: the new samples are drawn at random from the original sample.

In this section, we will see how and why the bootstrap works. In the rest of the chapter, we will use the bootstrap for inference.

Employee Compensation in the City of San Francisco#

SF OpenData is a website where the City and County of San Francisco make some of their data publicly available. One of the data sets contains compensation data for employees of the City. These include medical professionals at City-run hospitals, police officers, fire fighters, transportation workers, elected officials, and all other employees of the City.

Compensation data for the calendar year 2015 are in the table sf2015.

sf2015 = pd.read_csv('san_francisco_2015.csv')
sf2015.head()
Year Type Year Organization Group Code Organization Group Department Code Department Union Code Union Job Family Code Job Family ... Employee Identifier Salaries Overtime Other Salaries Total Salary Retirement Health/Dental Other Benefits Total Benefits Total Compensation
0 Calendar 2015 2 Public Works, Transportation & Commerce WTR PUC Water Department 21.0 Prof & Tech Engineers - Miscellaneous, Local 21 2400 Lab, Pharmacy & Med Techs ... 21538 82146.04 0.00 0.00 82146.04 16942.21 12340.88 6337.73 35620.82 117766.86
1 Calendar 2015 2 Public Works, Transportation & Commerce DPW General Services Agency - Public Works 12.0 Carpet, Linoleum and Soft Tile Workers, Local 12 7300 Journeyman Trade ... 5459 32165.75 973.19 848.96 33987.90 0.00 4587.51 2634.42 7221.93 41209.83
2 Calendar 2015 4 Community Health DPH Public Health 790.0 SEIU - Miscellaneous, Local 1021 1600 Payroll, Billing & Accounting ... 41541 71311.00 5757.98 0.00 77068.98 14697.59 12424.50 6370.06 33492.15 110561.13
3 Calendar 2015 4 Community Health DPH Public Health 351.0 Municipal Executive Association - Miscellaneous 0900 Management ... 26718 28430.25 0.00 763.07 29193.32 0.00 4223.14 5208.51 9431.65 38624.97
4 Calendar 2015 2 Public Works, Transportation & Commerce MTA Municipal Transportation Agency 790.0 SEIU - Miscellaneous, Local 1021 8200 Protection & Apprehension ... 45810 7948.75 0.00 0.00 7948.75 0.00 2873.17 616.24 3489.41 11438.16

5 rows × 22 columns

There is one row for each of 42,979 employees. There are numerous columns containing information about City departmental affiliation and details of the different parts of the employee’s compensation package. Here is the row corresponding to the late Mayor Ed Lee

sf2015[sf2015['Job'] == 'Mayor']
Year Type Year Organization Group Code Organization Group Department Code Department Union Code Union Job Family Code Job Family ... Employee Identifier Salaries Overtime Other Salaries Total Salary Retirement Health/Dental Other Benefits Total Benefits Total Compensation
3335 Calendar 2015 6 General Administration & Finance MYR Mayor 556.0 Elected Officials 1100 Administrative & Mgmt (Unrep) ... 22433 288963.55 0.0 0.0 288963.55 58117.03 12424.5 20292.95 90834.48 379798.03

1 rows × 22 columns

We are going to study the final column, Total Compensation. That’s the employee’s salary plus the City’s contribution towards his/her retirement and benefit plans.

Financial packages in a calendar year can sometimes be hard to understand as they depend on the date of hire, whether the employee is changing jobs within the City, and so on. For example, the lowest values in the Total Compensation column look a little strange.

sf2015.sort_values('Total Compensation').head(10)
Year Type Year Organization Group Code Organization Group Department Code Department Union Code Union Job Family Code Job Family ... Employee Identifier Salaries Overtime Other Salaries Total Salary Retirement Health/Dental Other Benefits Total Benefits Total Compensation
27308 Calendar 2015 1 Public Protection FIR Fire Department 798.0 Firefighters - Miscellaneous, Local 798 H000 Fire Services ... 43833 0.0 0.0 0.00 0.00 0.0 0.00 -423.76 -423.76 -423.76
15746 Calendar 2015 4 Community Health DPH Public Health 790.0 SEIU - Miscellaneous, Local 1021 9900 Public Service Aide ... 27871 -292.4 0.0 0.00 -292.40 0.0 -95.58 -22.63 -118.21 -410.61
24576 Calendar 2015 1 Public Protection JUV Juvenile Probation 790.0 SEIU - Miscellaneous, Local 1021 8300 Correction & Detention ... 10517 0.0 0.0 0.00 0.00 0.0 0.00 -159.12 -159.12 -159.12
42982 Calendar 2015 6 General Administration & Finance CPC City Planning 21.0 Prof & Tech Engineers - Miscellaneous, Local 21 1000 Information Systems ... 18961 0.0 0.0 0.00 0.00 0.0 0.00 -26.53 -26.53 -26.53
23310 Calendar 2015 6 General Administration & Finance CPC City Planning 21.0 Prof & Tech Engineers - Miscellaneous, Local 21 5200 Professional Engineering ... 19387 0.0 0.0 0.00 0.00 0.0 0.00 -9.51 -9.51 -9.51
27855 Calendar 2015 2 Public Works, Transportation & Commerce PUC PUC Public Utilities Commission 21.0 Prof & Tech Engineers - Miscellaneous, Local 21 1000 Information Systems ... 28988 0.0 0.0 0.00 0.00 0.0 0.00 -3.10 -3.10 -3.10
17663 Calendar 2015 1 Public Protection JUV Juvenile Probation 39.0 Stationary Engineers, Local 39 7300 Journeyman Trade ... 19125 0.0 0.0 0.00 0.00 0.0 0.00 -0.01 -0.01 -0.01
34350 Calendar 2015 1 Public Protection ECD Department of Emergency Management 351.0 Municipal Executive Association - Miscellaneous 0900 Management ... 30025 0.0 0.0 0.00 0.00 0.0 0.00 0.00 0.00 0.00
40184 Calendar 2015 7 General City Responsibilities UNA General Fund Unallocated 790.0 SEIU - Miscellaneous, Local 1021 3200 Recreation ... 49784 0.0 0.0 0.00 0.00 0.0 0.00 1.27 1.27 1.27
16641 Calendar 2015 4 Community Health DPH Public Health 250.0 SEIU - Health Workers, Local 1021 2600 Dietary & Food ... 26768 0.0 0.0 2.21 2.21 0.0 0.00 0.17 0.17 2.38

10 rows × 22 columns

For clarity of comparison, we will focus our attention on those who had at least the equivalent of a half-time job for the whole year. At a minimum wage of about $10 per hour, and 20 hours per week for 52 weeks, that’s a salary of about $10,000.

sf2015 = sf2015[sf2015['Salaries'] > 10000]
len(sf2015)
36569

Population and Parameter#

Let this table of just over 36,500 rows be our population. Here is a histogram of the total compensations.

sf_bins = np.arange(0, 700000, 25000)
sf2015['Total Compensation'].plot.hist(bins=sf_bins)
plt.xticks(rotation = 35);
../_images/041d1539d5d0abf966068a89982a7f7b408c751b9534be5d22b4cacd9eaf2973.png

While most of the values are below $300,000, a few are quite a bit higher. For example, the total compensation of the Chief Investment Officer was almost $650,000. That is why the horizontal axis stretches to $700,000.

sf2015.sort_values('Total Compensation', ascending=False).head(2)
Year Type Year Organization Group Code Organization Group Department Code Department Union Code Union Job Family Code Job Family ... Employee Identifier Salaries Overtime Other Salaries Total Salary Retirement Health/Dental Other Benefits Total Benefits Total Compensation
19177 Calendar 2015 6 General Administration & Finance RET Retirement System 351.0 Municipal Executive Association - Miscellaneous 1100 Administrative & Mgmt (Unrep) ... 46881 507831.6 0.00 0.00 507831.60 105052.98 12424.5 23566.16 141043.64 648875.24
13194 Calendar 2015 6 General Administration & Finance ADM General Services Agency - City Admin 164.0 Physicians and Dentists - Miscellaneous 2500 Med Therapy & Auxiliary ... 1016 279311.1 3829.36 114433.58 397574.04 56211.64 12424.5 14299.10 82935.24 480509.28

2 rows × 22 columns

Now let the parameter be the median of the total compensations.

Since we have the luxury of having all of the data from the population, we can simply calculate the parameter:

compensation = sf2015['Total Compensation']
pop_median = np.median(compensation)
pop_median
110305.79

The median total compensation of all employees was just over $110,300.

From a practical perspective, there is no reason for us to draw a sample to estimate this parameter since we simply know its value. But in this section we are going to pretend we don’t know the value, and see how well we can estimate it based on a random sample.

In later sections, we will come down to earth and work in situations where the parameter is unknown. For now, we are all-knowing.

A Random Sample and an Estimate#

Let us draw a sample of 500 employees at random without replacement, and let the median total compensation of the sampled employees serve as our estimate of the parameter.

# Sample size 500 from the compensation Series, without replacement.
our_sample = compensation.sample(500, replace=False)
our_sample.plot.hist(bins=sf_bins)
plt.xticks(rotation = 35);
../_images/88d80d536f742359b9f6b50fa0d2c821f52c061b268b6d4f2c08939434695fc6.png
est_median = np.median(our_sample)
est_median
111821.69

The sample size is large. By the law of averages, the distribution of the sample resembles that of the population, and consequently the sample median is not very far from the population median (though of course it is not exactly the same).

So now we have one estimate of the parameter. But had the sample come out differently, the estimate would have had a different value. We would like to be able to quantify the amount by which the estimate could vary across samples. That measure of variability will help us measure how accurately we can estimate the parameter.

To see how different the estimate would be if the sample had come out differently, we could just draw another sample from the population, but that would be cheating. We are trying to mimic real life, in which we won’t have all the population data at hand.

Somehow, we have to get another random sample without sampling from the population.

The Bootstrap: Resampling from the Sample#

What we do have is a large random sample from the population. As we know, a large random sample is likely to resemble the population from which it is drawn. This observation allows data scientists to lift themselves up by their own bootstraps: the sampling procedure can be replicated by sampling from the sample.

Here are the steps of the bootstrap method for generating another random sample that resembles the population:

  • Treat the original sample as if it were the population.

  • Draw from the sample, at random with replacement, the same number of times as the original sample size.

It is important to resample the same number of times as the original sample size. The reason is that the variability of an estimate depends on the size of the sample. Since our original sample consisted of 500 employees, our sample median was based on 500 values. To see how different the sample could have been, we have to compare it to the median of other samples of size 500.

If we drew 500 times at random without replacement from our sample of size 500, we would just get the same sample back. By drawing with replacement, we create the possibility for the new samples to be different from the original, because some employees might be drawn more than once and others not at all.

Why is this a good idea? By the law of averages, the distribution of the original sample is likely to resemble the population, and the distributions of all the “resamples” are likely to resemble the original sample. So the distributions of all the resamples are likely to resemble the population as well.

A Resampled Median#

We will use the sample method of the Series to take the new sample with replacement.

We want a sample the same size as the original sample (500), where we are sampling with replacement.

Here is one new sample drawn from the original sample, with replacement, and the corresponding sample median.

resample_1 = our_sample.sample(500, replace=True)
resample_1
9376     143322.30
7911     132618.37
39468    212434.39
31659     64376.25
40612    201471.64
           ...    
41968     65011.35
29337    118400.70
37218    122570.79
35502     53327.36
41820    166393.44
Name: Total Compensation, Length: 500, dtype: float64
resample_1.plot.hist(bins=sf_bins)
plt.xticks(rotation = 35);
../_images/4343cac2bbdff5e27eb867b5b60318fec8b88f9cafc24b1622b15b078a4da3e5.png
resampled_median_1 = np.median(resample_1)
resampled_median_1
112491.255

By resampling, we have something like another estimate of the population median. By resampling again and again, we will get many such estimates, and hence an empirical distribution of the estimates. We can call this distribution a sampling distribution of the median - it is the distribution of the median values we get from taking many samples. Our samples are bootstrap samples, so this is the bootstrap sampling distribution of the median.

resample_2 = our_sample.sample(500, replace=True)
resampled_median_2 = np.median(resample_2)
resampled_median_2
110888.59

Bootstrap Empirical Distribution of the Sample Median#

Let us define a function bootstrap_median that takes our original sample, and the number of bootstrap samples we want to take, and returns an array of the corresponding resampled medians.

Each time we resample and find the median, we replicate the bootstrap process. So the number of bootstrap samples will be called the number of replications.

def bootstrap_median(original_sample, replications):
    """Return array of bootstrapped sample medians

    Parameters
    ----------
    original_sample: Series
        Series containing the original sample
    replications: number
        number of bootstrap samples

    Returns
    -------
    samp_meds : array
        Array of bootstrapped sample medians.
    """
    n = len(original_sample)
    medians = np.zeros(replications)
    for i in np.arange(replications):
        bootstrap_sample = original_sample.sample(n, replace=True)
        medians[i] = np.median(bootstrap_sample)
    return medians

We now replicate the bootstrap process 5,000 times. The array bstrap_medians contains the medians of all 5,000 bootstrap samples. Notice that the code takes longer to run than our previous code. It has a lot of resampling to do!

bstrap_medians = bootstrap_median(our_sample, 5000)

Here is the histogram of the 5000 medians. The red dot is the population parameter: it is the median of the entire population, which we happen to know but did not use in the bootstrap process.

resampled_medians = pd.DataFrame()
resampled_medians['Bootstrap Sample Median'] = bstrap_medians

resampled_medians.plot.hist()

plt.scatter(pop_median, 0, color='red', s=30);
plt.title('Bootstrap sampling distribution of median')
plt.xticks(rotation = 35);
../_images/5f775fa95b5386fad43dec3d5f54fa812a636f66686dd13db15bd249a1dcc842.png

It is important to remember that the red dot is fixed: it is $110,305.79, the population median. The empirical histogram is the result of random draws, and will be situated randomly relative to the red dot.

Remember also that the point of all these computations is to estimate the population median, which is the red dot. Our estimates are all the randomly generated sampled medians whose histogram you see above. We want those estimates to contain the parameter – it they don’t, then they are off.

Do the Estimates Capture the Parameter?#

How often does the empirical histogram of the resampled medians sit firmly over the red dot, and not just brush the dot with its tails? To answer this, we must define “sit firmly”. Let’s take that to mean “the middle 95% of the resampled medians contains the red dot”.

Here are the two ends of the “middle 95%” interval of resampled medians:

left = np.percentile(bstrap_medians, 2.5)
left
102846.14
right = np.percentile(bstrap_medians, 97.5)
right
117030.735

In fact we can get both left and right percentiles in one call, like this:

np.percentile(bstrap_medians, [2.5, 97.5])
array([102846.14 , 117030.735])

We can use unpacking in the usual way, to set our left and right variables:

left, right = np.percentile(bstrap_medians, [2.5, 97.5])
print('Left', left)
print('Right', right)
Left 102846.14
Right 117030.735

The population median of $110,305 is between these two numbers, left and right. We show the interval and the population median on the histogram below.

resampled_medians.hist()

plt.plot([left, right], [0, 0], color='yellow', lw=8, zorder=1)
plt.scatter(pop_median, 0, color='red', s=200, zorder=2)
plt.xticks(rotation = 35);
../_images/69791cd7736d8b7eee8d258f9604ca401618d40b6bf6a80d9a1d26e26a754edc.png

The “middle 95%” interval of estimates captured the parameter in our example. But was that a fluke?

To see how frequently the interval contains the parameter, we have to run the entire process over and over again. Specifically, we will repeat the following process 100 times:

  • Draw an original sample of size 500 from the population.

  • Carry out 5,000 replications of the bootstrap process and generate the “middle 95%” interval of resampled medians.

As usual we start by doing one trial, where we take one sample, and generate one left and one right interval. This is the code we have been using above.

# One trial, generating a left and right interval.
# Take a new sample.
this_sample = compensation.sample(500, replace=True)
# Get the bootstrap sampling distribution of the median.
these_medians = bootstrap_median(this_sample, 5000)
# Calculate the left, right ends.
left_end, right_end = np.percentile(these_medians, [2.5, 97.5])
print('Left', left_end)
print('Right', right_end)
Left 103615.15
Right 115365.95137499996

We will repeat this trial procedure 100 times, to give with 100 left and right intervals, and count how many of these intervals contain the population median.

Spoiler alert: The statistical theory of the bootstrap says that the number should be around 95. It may be in the low 90s or high 90s, but not much farther off 95 than that.

# THE BIG SIMULATION: This one can take several minutes.

# Set up to make 100 left and right intervals
n_intervals = 100
left_ends = np.zeros(n_intervals)
right_ends = np.zeros(n_intervals)

for i in np.arange(100):
    # One trial, generating a left and right interval.
    # Take a new sample.
    this_sample = compensation.sample(500, replace=True)
    # Get the bootstrap sampling distribution of the median.
    these_medians = bootstrap_median(this_sample, 5000)
    # Calculate the left, right ends.
    left_end, right_end = np.percentile(these_medians, [2.5, 97.5])
    # Store the results for this trial.
    left_ends[i] = left_end
    right_ends[i] = right_end

# Put interval ends into own data frame.
intervals = pd.DataFrame()
intervals['Left'] = left_ends
intervals['Right'] = right_ends

For each of the 100 replications, we get one (left, right) interval of estimates of the median.

intervals
Left Right
0 105718.920000 117254.26000
1 109548.790000 116720.91000
2 98879.600000 108725.96000
3 106637.950000 117715.17500
4 103770.830000 116670.68500
... ... ...
95 101206.635000 112782.99000
96 100469.410000 112519.62000
97 108956.835000 121942.23000
98 98006.126625 110011.19500
99 109616.755000 119568.31325

100 rows × 2 columns

The good intervals are those that contain the parameter we are trying to estimate. Typically the parameter is unknown, but in this section we happen to know what the parameter is.

pop_median
110305.79

How many of the 100 intervals contain the population median? That’s the number of intervals where the left end is below the population median and the right end is above.

inside = np.logical_and(left_ends < pop_median, right_ends > pop_median)
intervals[inside]
Left Right
0 105718.920 117254.26000
1 109548.790 116720.91000
3 106637.950 117715.17500
4 103770.830 116670.68500
6 103359.820 113926.80000
... ... ...
94 101395.155 113536.83000
95 101206.635 112782.99000
96 100469.410 112519.62000
97 108956.835 121942.23000
99 109616.755 119568.31325

94 rows × 2 columns

It takes a few minutes to construct all the intervals, but try it again if you have the patience. Most likely, about 95 of the 100 intervals will be good ones: they will contain the parameter.

It’s hard to show you all the intervals on the horizontal axis as they have large overlaps – after all, they are all trying to estimate the same parameter. The graphic below shows each interval on the same axes by stacking them vertically. The vertical axis is simply the number of the replication from which the interval was generated.

The red line is where the parameter is. Good intervals cover the parameter; there are about 95 of these, typically.

If an interval doesn’t cover the parameter, it’s a dud. The duds are the ones where you can see “daylight” around the red line. There are very few of them – about 5, typically – but they do happen.

Any method based on sampling has the possibility of being off. The beauty of methods based on random sampling is that we can quantify how often they are likely to be off.

Hide code cell content
plt.figure(figsize=(8,8))
for i in np.arange(len(intervals)):
    ends = intervals.iloc[i]
    plt.plot(ends, [i, i], color='gold')
plt.plot([pop_median, pop_median], [0, 100],
         color='red', lw=2)
plt.xlabel('Median (dollars)')
plt.ylabel('Replication')
plt.title('Population Median and Intervals of Estimates');
../_images/3171fe897a430021562c4dec97bafacd12481d7e25e13674f83263cd9bc2490f.png

To summarize what the simulation shows, suppose you are estimating the population median by the following process:

  • Draw a large random sample from the population.

  • Bootstrap your random sample and get an estimate from the new random sample.

  • Repeat the above step thousands of times, and get thousands of estimates.

  • Pick off the “middle 95%” interval of all the estimates.

That gives you one interval of estimates. Now if you repeat the entire process 100 times, ending up with 100 intervals, then about 95 of those 100 intervals will contain the population parameter.

In other words, this process of estimation captures the parameter about 95% of the time.

You can replace 95% by a different value, as long as it’s not 100. Suppose you replace 95% by 80% and keep the sample size fixed at 500. Then your intervals of estimates will be shorter than those we simulated here, because the “middle 80%” is a smaller range than the “middle 95%”. Only about 80% of your intervals will contain the parameter.

Note

This page has content from the Bootstrap notebook of an older version of the UC Berkeley data science course. See the Berkeley course section of the license file.