Missing values

Missing values#

# Load the Numpy array library, call it 'np'
import numpy as np
# Load the Pandas data science library, call it 'pd'
import pandas as pd
# Turn on a setting to use Pandas more safely.
pd.set_option('mode.copy_on_write', True)

If you are running on your laptop, you should download the gender_stats_min.csv file to the same directory as this notebook.

See the gender statistics description page for more detail on the dataset.

# Load the data file
gender_data = pd.read_csv('gender_stats_min.csv')
gender_data.head()
country_name country_code gdp_us_billion mat_mort_ratio population
0 Aruba ABW NaN NaN 0.103744
1 Afghanistan AFG 19.961015 444.00 32.715838
2 Angola AGO 111.936542 501.25 26.937545
3 Albania ALB 12.327586 29.25 2.888280
4 Andorra AND 3.197538 NaN 0.079547
# Get the GDP values as a Pandas Series
gdp = gender_data['gdp_us_billion']
gdp.head()
0           NaN
1     19.961015
2    111.936542
3     12.327586
4      3.197538
Name: gdp_us_billion, dtype: float64

Missing values and NaN#

Looking at the values of gdp (and therefore, the values of the gdp_us_billion column of gender_data, we see that some of the values are NaN, which means Not a Number. Pandas uses this marker to indicate values that are not available, or missing data.

Numpy does not like to calculate with NaN values. Here is Numpy trying to calculate the median of the gdp values.

np.median(gdp)
nan

Notice the warning about an invalid value.

Numpy recognizes that one or more values are NaN and refuses to guess what to do, when calculating the median.

You saw from the shape above that gender_data has 216 rows. We can use the general Python len function, to see how many elements there are in gdp.

len(gdp)
216

As expected, it has the same number of elements as there are rows in gender_data.

The count method of the series gives the number of values that are not missing - that is - not NaN.

gdp.count()
200