Introduction to data frames#

Pandas is a Python package that implements data frames, and functions that operate on data frames.

# Load the Pandas data science library, call it 'pd'
import pandas as pd
# Turn on a setting to use Pandas more safely.
# We will discuss this setting later.
pd.set_option('mode.copy_on_write', True)

We will also use the usual Numpy array library:

# Load the Numpy array library, call it 'np'
import numpy as np

Loading a data frame from a file#

We start by loading data from a Comma Separated Value file (CSV file). If you are running on your laptop, you should download the gender_stats_min.csv file to the same directory as this notebook.

See the gender statistics description page for more detail on the dataset.

# Load the data file
gender_data = pd.read_csv('gender_stats_min.csv')

This is our usual assignment statement. The Left Hand Side (LHS) is gender_data, the variable name. The RHS is an expression, that returns a value.

What type of value does it return?

type(gender_data)
pandas.core.frame.DataFrame

This tells us that the gender_data value (or object) is of type DataFrame. We can also see that that type is something that belongs to (is defined by) the Pandas library.

As you can see, the DataFrame is a thing that contains a table of data. The table has rows and columns.

Pandas integrates with the Notebook, so, if you display a data frame in the notebook, it does a nice display:

gender_data
country_name country_code gdp_us_billion mat_mort_ratio population
0 Aruba ABW NaN NaN 0.103744
1 Afghanistan AFG 19.961015 444.00 32.715838
2 Angola AGO 111.936542 501.25 26.937545
3 Albania ALB 12.327586 29.25 2.888280
4 Andorra AND 3.197538 NaN 0.079547
... ... ... ... ... ...
211 Kosovo XKX 6.804620 NaN 1.813820
212 Yemen, Rep. YEM 36.819337 399.75 26.246608
213 South Africa ZAF 345.209888 143.75 54.177209
214 Zambia ZMB 24.280990 233.75 15.633220
215 Zimbabwe ZWE 15.495514 398.00 15.420964

216 rows × 5 columns

This default display for the DataFrame shows you the first five rows, then a row of ... and then the last five rows. The row of ... shows you that there are more rows the display does not show between the first five and the last five.

What does the data mean?#

In order to interpret these data, you need more information about what these column names refer to. This information is sometimes called the data dictionary. Here are the longer descriptions from the original data source (link above):

  • gdp_us_billion: GDP (in current US $ billions).

  • mat_mort_ratio: Maternal mortality ratio (modeled estimate, per 100,000 live births).

  • population: Population, total (millions).

Missing values#

Notice the NaN at the top of the GDP column. This is a missing value. We will come to these in missing values.

For the moment, we will do something quick and dirty, which is to drop all the missing values from the data frame. Be careful - this is rarely the right thing to do, without a lot of investigation as to why the values are missing.

# Drop all missing values.  Be careful, this is rarely the right thing to do.
gender_data_no_na = gender_data.dropna()
gender_data_no_na
country_name country_code gdp_us_billion mat_mort_ratio population
1 Afghanistan AFG 19.961015 444.00 32.715838
2 Angola AGO 111.936542 501.25 26.937545
3 Albania ALB 12.327586 29.25 2.888280
5 United Arab Emirates ARE 375.027082 6.00 9.080299
6 Argentina ARG 550.980968 53.75 42.976675
... ... ... ... ... ...
210 Samoa WSM 0.799887 54.75 0.192225
212 Yemen, Rep. YEM 36.819337 399.75 26.246608
213 South Africa ZAF 345.209888 143.75 54.177209
214 Zambia ZMB 24.280990 233.75 15.633220
215 Zimbabwe ZWE 15.495514 398.00 15.420964

179 rows × 5 columns

Attributes#

Like other Python objects (values), the DataFrame has attributes.

An attribute is some named value attached to another value. You can think of it as a variable attached to a value. You can fetch the attached value using the <value>.<attribute_name> syntax. For example, one attribute of the data frame, is the shape. The <value> in our case is gender_data_no_na, and the <attribute_name> is shape.

gender_data_no_na.shape
(179, 5)

Notice that the .shape attribute is a sequence of two values. The first is the number of rows and the second is the number of columns.

Columns#

As you would expect for a table, the DataFrame has columns, and the columns have labels. You can see the column labels in the display above, but you can also get the column labels using the .columns attribute of the DataFrame.

gender_data_no_na.columns
Index(['country_name', 'country_code', 'gdp_us_billion', 'mat_mort_ratio',
       'population'],
      dtype='object')

Notice from the display above, that Pandas wraps up the column names in their own value, of type Index.

type(gender_data_no_na.columns)
pandas.core.indexes.base.Index

The Index type is Pandas way of storing a sequence of labels — in this case column labels.

You can get the column labels (names) as strings by using list on the .columns, like this:

# Get column names from .columns attribute.
list(gender_data_no_na.columns)
['country_name',
 'country_code',
 'gdp_us_billion',
 'mat_mort_ratio',
 'population']

In fact, there is a short-cut for doing that, which is to apply list to the DataFrame itself. In that case, the DataFrame interprets you to be asking for the column names:

# Get column names from DataFrame directly.
list(gender_data_no_na)
['country_name',
 'country_code',
 'gdp_us_billion',
 'mat_mort_ratio',
 'population']