Introduction to data frames#
Pandas is a Python package that implements data frames, and functions that operate on data frames.
# Load the Pandas data science library, call it 'pd'
import pandas as pd
# Turn on a setting to use Pandas more safely.
# We will discuss this setting later.
pd.set_option('mode.copy_on_write', True)
We will also use the usual Numpy array library:
# Load the Numpy array library, call it 'np'
import numpy as np
Loading a data frame from a file#
We start by loading data from a Comma Separated Value file (CSV
file). If you are running on your laptop, you should download
the gender_stats_min.csv
file to the same directory as this notebook.
See the gender statistics description page for more detail on the dataset.
# Load the data file
gender_data = pd.read_csv('gender_stats_min.csv')
This is our usual assignment statement. The Left Hand Side (LHS) is
gender_data
, the variable name. The RHS is an expression, that returns a
value.
What type of value does it return?
type(gender_data)
pandas.core.frame.DataFrame
This tells us that the gender_data
value (or object) is of type DataFrame
.
We can also see that that type is something that belongs to (is defined by) the
Pandas library.
As you can see, the DataFrame is a thing that contains a table of data. The table has rows and columns.
Pandas integrates with the Notebook, so, if you display a data frame in the notebook, it does a nice display:
gender_data
country_name | country_code | gdp_us_billion | mat_mort_ratio | population | |
---|---|---|---|---|---|
0 | Aruba | ABW | NaN | NaN | 0.103744 |
1 | Afghanistan | AFG | 19.961015 | 444.00 | 32.715838 |
2 | Angola | AGO | 111.936542 | 501.25 | 26.937545 |
3 | Albania | ALB | 12.327586 | 29.25 | 2.888280 |
4 | Andorra | AND | 3.197538 | NaN | 0.079547 |
... | ... | ... | ... | ... | ... |
211 | Kosovo | XKX | 6.804620 | NaN | 1.813820 |
212 | Yemen, Rep. | YEM | 36.819337 | 399.75 | 26.246608 |
213 | South Africa | ZAF | 345.209888 | 143.75 | 54.177209 |
214 | Zambia | ZMB | 24.280990 | 233.75 | 15.633220 |
215 | Zimbabwe | ZWE | 15.495514 | 398.00 | 15.420964 |
216 rows × 5 columns
This default display for the DataFrame shows you the first five rows, then a
row of ...
and then the last five rows. The row of ...
shows you that
there are more rows the display does not show between the first five and the
last five.
What does the data mean?#
In order to interpret these data, you need more information about what these column names refer to. This information is sometimes called the data dictionary. Here are the longer descriptions from the original data source (link above):
gdp_us_billion
: GDP (in current US $ billions).mat_mort_ratio
: Maternal mortality ratio (modeled estimate, per 100,000 live births).population
: Population, total (millions).
Missing values#
Notice the NaN
at the top of the GDP column. This is a missing value. We
will come to these in missing values.
For the moment, we will do something quick and dirty, which is to drop all the missing values from the data frame. Be careful - this is rarely the right thing to do, without a lot of investigation as to why the values are missing.
# Drop all missing values. Be careful, this is rarely the right thing to do.
gender_data_no_na = gender_data.dropna()
gender_data_no_na
country_name | country_code | gdp_us_billion | mat_mort_ratio | population | |
---|---|---|---|---|---|
1 | Afghanistan | AFG | 19.961015 | 444.00 | 32.715838 |
2 | Angola | AGO | 111.936542 | 501.25 | 26.937545 |
3 | Albania | ALB | 12.327586 | 29.25 | 2.888280 |
5 | United Arab Emirates | ARE | 375.027082 | 6.00 | 9.080299 |
6 | Argentina | ARG | 550.980968 | 53.75 | 42.976675 |
... | ... | ... | ... | ... | ... |
210 | Samoa | WSM | 0.799887 | 54.75 | 0.192225 |
212 | Yemen, Rep. | YEM | 36.819337 | 399.75 | 26.246608 |
213 | South Africa | ZAF | 345.209888 | 143.75 | 54.177209 |
214 | Zambia | ZMB | 24.280990 | 233.75 | 15.633220 |
215 | Zimbabwe | ZWE | 15.495514 | 398.00 | 15.420964 |
179 rows × 5 columns
Attributes#
Like other Python objects (values), the DataFrame has attributes.
An attribute is some named value attached to another value. You can think
of it as a variable attached to a value. You can fetch the attached value
using the <value>.<attribute_name>
syntax. For example, one attribute of the
data frame, is the shape
. The <value>
in our case is gender_data_no_na
,
and the <attribute_name>
is shape
.
gender_data_no_na.shape
(179, 5)
Notice that the .shape
attribute is a sequence of two values. The first is
the number of rows and the second is the number of columns.
Columns#
As you would expect for a table, the DataFrame has columns, and the columns
have labels. You can see the column labels in the display above, but you
can also get the column labels using the .columns
attribute of the DataFrame.
gender_data_no_na.columns
Index(['country_name', 'country_code', 'gdp_us_billion', 'mat_mort_ratio',
'population'],
dtype='object')
Notice from the display above, that Pandas wraps up the column names in their
own value, of type Index
.
type(gender_data_no_na.columns)
pandas.core.indexes.base.Index
The Index
type is Pandas way of storing a sequence of labels — in this case
column labels.
You can get the column labels (names) as strings by using list
on the
.columns
, like this:
# Get column names from .columns attribute.
list(gender_data_no_na.columns)
['country_name',
'country_code',
'gdp_us_billion',
'mat_mort_ratio',
'population']
In fact, there is a short-cut for doing that, which is to apply list
to the
DataFrame itself. In that case, the DataFrame interprets you to be asking for
the column names:
# Get column names from DataFrame directly.
list(gender_data_no_na)
['country_name',
'country_code',
'gdp_us_billion',
'mat_mort_ratio',
'population']