Introduction to data frames

Introduction to data frames#

Pandas is a Python package that implements data frames, and functions that operate on data frames.

# Load the Pandas data science library, call it 'pd'
import pandas as pd
# Turn on a setting to use Pandas more safely.
# We will discuss this setting later.
pd.set_option('mode.copy_on_write', True)

We will also use the usual Numpy array library:

# Load the Numpy array library, call it 'np'
import numpy as np

Loading a data frame from a file#

We start by loading data from a Comma Separated Value file (CSV file). If you are running on your laptop, you should download the gender_stats_min.csv file to the same directory as this notebook.

See the gender statistics description page for more detail on the dataset.

# Load the data file
gender_data = pd.read_csv('gender_stats_min.csv')

This is our usual assignment statement. The Left Hand Side (LHS) is gender_data, the variable name. The RHS is an expression, that returns a value.

What type of value does it return?

type(gender_data)

pandas.core.frame.DataFrame

This tells us that the gender_data value (or object) is of type DataFrame. We can also see that that type is something that belongs to (is defined by) the Pandas library.

As you can see, the DataFrame is a thing that contains a table of data. The table has rows and columns.

Pandas integrates with the Notebook, so, if you display a data frame in the notebook, it does a nice display:

gender_data

	country_name	country_code	gdp_us_billion	mat_mort_ratio	population
0	Aruba	ABW	NaN	NaN	0.103744
1	Afghanistan	AFG	19.961015	444.00	32.715838
2	Angola	AGO	111.936542	501.25	26.937545
3	Albania	ALB	12.327586	29.25	2.888280
4	Andorra	AND	3.197538	NaN	0.079547
...	...	...	...	...	...
211	Kosovo	XKX	6.804620	NaN	1.813820
212	Yemen, Rep.	YEM	36.819337	399.75	26.246608
213	South Africa	ZAF	345.209888	143.75	54.177209
214	Zambia	ZMB	24.280990	233.75	15.633220
215	Zimbabwe	ZWE	15.495514	398.00	15.420964

216 rows × 5 columns

This default display for the DataFrame shows you the first five rows, then a row of ... and then the last five rows. The row of ... shows you that there are more rows the display does not show between the first five and the last five.

What does the data mean?#

In order to interpret these data, you need more information about what these column names refer to. This information is sometimes called the data dictionary. Here are the longer descriptions from the original data source (link above):

gdp_us_billion: GDP (in current US $ billions).
mat_mort_ratio: Maternal mortality ratio (modeled estimate, per 100,000 live births).
population: Population, total (millions).

Missing values#

Notice the NaN at the top of the GDP column. This is a missing value. We will come to these in missing values.

For the moment, we will do something quick and dirty, which is to drop all the missing values from the data frame. Be careful - this is rarely the right thing to do, without a lot of investigation as to why the values are missing.

# Drop all missing values.  Be careful, this is rarely the right thing to do.
gender_data_no_na = gender_data.dropna()
gender_data_no_na

	country_name	country_code	gdp_us_billion	mat_mort_ratio	population
1	Afghanistan	AFG	19.961015	444.00	32.715838
2	Angola	AGO	111.936542	501.25	26.937545
3	Albania	ALB	12.327586	29.25	2.888280
5	United Arab Emirates	ARE	375.027082	6.00	9.080299
6	Argentina	ARG	550.980968	53.75	42.976675
...	...	...	...	...	...
210	Samoa	WSM	0.799887	54.75	0.192225
212	Yemen, Rep.	YEM	36.819337	399.75	26.246608
213	South Africa	ZAF	345.209888	143.75	54.177209
214	Zambia	ZMB	24.280990	233.75	15.633220
215	Zimbabwe	ZWE	15.495514	398.00	15.420964

179 rows × 5 columns

Attributes#

Like other Python objects (values), the DataFrame has attributes.

An attribute is some named value attached to another value. You can think of it as a variable attached to a value. You can fetch the attached value using the <value>.<attribute_name> syntax. For example, one attribute of the data frame, is the shape. The <value> in our case is gender_data_no_na, and the <attribute_name> is shape.

gender_data_no_na.shape

(179, 5)

Notice that the .shape attribute is a sequence of two values. The first is the number of rows and the second is the number of columns.

Columns#

As you would expect for a table, the DataFrame has columns, and the columns have labels. You can see the column labels in the display above, but you can also get the column labels using the .columns attribute of the DataFrame.

gender_data_no_na.columns

Index(['country_name', 'country_code', 'gdp_us_billion', 'mat_mort_ratio',
       'population'],
      dtype='object')

Notice from the display above, that Pandas wraps up the column names in their own value, of type Index.

type(gender_data_no_na.columns)

pandas.core.indexes.base.Index

The Index type is Pandas way of storing a sequence of labels — in this case column labels.

You can get the column labels (names) as strings by using list on the .columns, like this:

# Get column names from .columns attribute.
list(gender_data_no_na.columns)

['country_name',
 'country_code',
 'gdp_us_billion',
 'mat_mort_ratio',
 'population']

In fact, there is a short-cut for doing that, which is to apply list to the DataFrame itself. In that case, the DataFrame interprets you to be asking for the column names:

# Get column names from DataFrame directly.
list(gender_data_no_na)

['country_name',
 'country_code',
 'gdp_us_billion',
 'mat_mort_ratio',
 'population']