Missing values#
# Load the Numpy array library, call it 'np'
import numpy as np
# Load the Pandas data science library, call it 'pd'
import pandas as pd
# Turn on a setting to use Pandas more safely.
pd.set_option('mode.copy_on_write', True)
If you are running on your laptop, you should download the
gender_stats_min.csv
file to the same
directory as this notebook.
See the gender statistics description page for more detail on the dataset.
# Load the data file
gender_data = pd.read_csv('gender_stats_min.csv')
gender_data.head()
country_name | country_code | gdp_us_billion | mat_mort_ratio | population | |
---|---|---|---|---|---|
0 | Aruba | ABW | NaN | NaN | 0.103744 |
1 | Afghanistan | AFG | 19.961015 | 444.00 | 32.715838 |
2 | Angola | AGO | 111.936542 | 501.25 | 26.937545 |
3 | Albania | ALB | 12.327586 | 29.25 | 2.888280 |
4 | Andorra | AND | 3.197538 | NaN | 0.079547 |
# Get the GDP values as a Pandas Series
gdp = gender_data['gdp_us_billion']
gdp.head()
0 NaN
1 19.961015
2 111.936542
3 12.327586
4 3.197538
Name: gdp_us_billion, dtype: float64
Missing values and NaN
#
Looking at the values of gdp
(and therefore, the values of the
gdp_us_billion
column of gender_data
, we see that some of the values are
NaN
, which means Not a Number. Pandas uses this marker to indicate values
that are not available, or missing data.
Numpy does not like to calculate with NaN
values. Here is Numpy trying to
calculate the median of the gdp
values.
np.median(gdp)
nan
Notice the warning about an invalid value.
Numpy recognizes that one or more values are NaN
and refuses to guess what to do, when calculating the median.
You saw from the shape above that gender_data
has 216 rows. We can use the
general Python len
function, to see how many elements there are in gdp
.
len(gdp)
216
As expected, it has the same number of elements as there are rows in gender_data
.
The count
method of the series gives the number of values that are not
missing - that is - not NaN
.
gdp.count()
200