Missing values

Missing values#

# Load the Numpy array library, call it 'np'
import numpy as np
# Load the Pandas data science library, call it 'pd'
import pandas as pd
# Turn on a setting to use Pandas more safely.
pd.set_option('mode.copy_on_write', True)

If you are running on your laptop, you should download the gender_stats_min.csv file to the same directory as this notebook.

See the gender statistics description page for more detail on the dataset.

# Load the data file
gender_data = pd.read_csv('gender_stats_min.csv')
gender_data.head()

	country_name	country_code	gdp_us_billion	mat_mort_ratio	population
0	Aruba	ABW	NaN	NaN	0.103744
1	Afghanistan	AFG	19.961015	444.00	32.715838
2	Angola	AGO	111.936542	501.25	26.937545
3	Albania	ALB	12.327586	29.25	2.888280
4	Andorra	AND	3.197538	NaN	0.079547

# Get the GDP values as a Pandas Series
gdp = gender_data['gdp_us_billion']
gdp.head()

         NaN
   19.961015
  111.936542
   12.327586
    3.197538
Name: gdp_us_billion, dtype: float64

Missing values and `NaN`#

Looking at the values of gdp (and therefore, the values of the gdp_us_billion column of gender_data, we see that some of the values are NaN, which means Not a Number. Pandas uses this marker to indicate values that are not available, or missing data.

Numpy does not like to calculate with NaN values. Here is Numpy trying to calculate the median of the gdp values.

np.median(gdp)

nan

Notice the warning about an invalid value.

Numpy recognizes that one or more values are NaN and refuses to guess what to do, when calculating the median.

You saw from the shape above that gender_data has 216 rows. We can use the general Python len function, to see how many elements there are in gdp.

len(gdp)

As expected, it has the same number of elements as there are rows in gender_data.

The count method of the series gives the number of values that are not missing - that is - not NaN.

gdp.count()

Missing values

Contents

Missing values#

Missing values and NaN#

Missing values and `NaN`#