Selecting columns from DataFrames

Selecting columns from DataFrames#

# Load the Pandas data science library, call it 'pd'
import pandas as pd
# Turn on a setting to use Pandas more safely.
# We will discuss this setting later.
pd.set_option('mode.copy_on_write', True)

# Load the library for plotting, name it 'plt'
import matplotlib.pyplot as plt
# Make plots look a little more fancy
plt.style.use('fivethirtyeight')

Let us return to the DataFrame we had at the end of the previous notebook The data frame index.

# Original data frame before dropping missing values.
gender_data = pd.read_csv('gender_stats_min.csv')
gender_data_no_na = gender_data.dropna()
labeled_gdata = gender_data_no_na.set_index('country_code')
# Show the result
labeled_gdata

	country_name	gdp_us_billion	mat_mort_ratio	population
country_code
AFG	Afghanistan	19.961015	444.00	32.715838
AGO	Angola	111.936542	501.25	26.937545
ALB	Albania	12.327586	29.25	2.888280
ARE	United Arab Emirates	375.027082	6.00	9.080299
ARG	Argentina	550.980968	53.75	42.976675
...	...	...	...	...
WSM	Samoa	0.799887	54.75	0.192225
YEM	Yemen, Rep.	36.819337	399.75	26.246608
ZAF	South Africa	345.209888	143.75	54.177209
ZMB	Zambia	24.280990	233.75	15.633220
ZWE	Zimbabwe	15.495514	398.00	15.420964

179 rows × 4 columns

Selecting columns#

You have seen array indexing (in Selecting with arrays. You remember that array indexing uses square brackets. Indexing is the technical term for getting data from a value using square brackets. Data frames also allow indexing. For example, we often want to get all the data for a single column of the data frame. To do this, we use the same square bracket notation as we use for array indexing, with the name of the column inside the square brackets.

gdp = labeled_gdata['gdp_us_billion']

Call this column label indexing.

We all also call this direct indexing with column labels or DICL.

By direct indexing we mean indexing a DataFrame using square brackets after the DataFrame value — the kind of indexing you will be familiar with from arrays. Later we will come across another form of indexing, called indirect indexing, but that is for another time.

What type of value came back from this direct indexing with column labels?

type(gdp)

pandas.core.series.Series

Pandas sent back a Series value from our selection of the column. The Series is a Pandas’ way of representing a column of data. Here is the default display for the gdp Series:

gdp

country_code
AFG     19.961015
AGO    111.936542
ALB     12.327586
ARE    375.027082
ARG    550.980968
          ...    
WSM      0.799887
YEM     36.819337
ZAF    345.209888
ZMB     24.280990
ZWE     15.495514
Name: gdp_us_billion, Length: 179, dtype: float64

The display shows the values from the column, that you saw previously in the DataFrame gdp_us_billion column. You see the values printed on the right. Notice too that the Series has kept the row labels from the DataFrame, so each value now has it’s corresponding row label. The index values are on the left.

Notice that, if your string specifying the column name does not match a column name exactly, you will get a long error. This gives you some practice in reading long error messages - skip to the end first, you will often see the most helpful information there.

# The correct column name is in lower case.
labeled_gdata['GDP_US_BILLION']

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/indexes/base.py:3805, in Index.get_loc(self, key)
   3804 try:
-> 3805     return self._engine.get_loc(casted_key)
   3806 except KeyError as err:

File index.pyx:167, in pandas._libs.index.IndexEngine.get_loc()

File index.pyx:196, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:7081, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:7089, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'GDP_US_BILLION'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[6], line 2
      1 # The correct column name is in lower case.
----> 2 labeled_gdata['GDP_US_BILLION']

File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/frame.py:4102, in DataFrame.__getitem__(self, key)
   4100 if self.columns.nlevels > 1:
   4101     return self._getitem_multilevel(key)
-> 4102 indexer = self.columns.get_loc(key)
   4103 if is_integer(indexer):
   4104     indexer = [indexer]

File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/indexes/base.py:3812, in Index.get_loc(self, key)
   3807     if isinstance(casted_key, slice) or (
   3808         isinstance(casted_key, abc.Iterable)
   3809         and any(isinstance(x, slice) for x in casted_key)
   3810     ):
   3811         raise InvalidIndexError(key)
-> 3812     raise KeyError(key) from err
   3813 except TypeError:
   3814     # If we have a listlike key, _check_indexing_error will raise
   3815     #  InvalidIndexError. Otherwise we fall through and re-raise
   3816     #  the TypeError.
   3817     self._check_indexing_error(key)

KeyError: 'GDP_US_BILLION'

GDP and Maternal Mortality Rate#

There are two ways of getting plots from data in data frames. Here we use the most basic method, that you have already seen. Soon, we will get onto a more elegant plotting method.

The gdp variable is a sequence of values, so we can do a histogram on these values, as we have done histograms on arrays.

plt.hist(gdp)
plt.title('Gross Domestic Product (US billion)');

../_images/1dbb25312c841207c168cba09c509c8db068584a11ae2e29f263c8fd1b04ae24.png

Now we have had a look at the GDP values, we will look at the values for the Maternal Mortality Rate (MMR). The MMR is the number of women who die in childbirth for every 100,000 births. These are the numbers in the mat_mort_ratio column.

mmr = labeled_gdata['mat_mort_ratio']
mmr

country_code
AFG    444.00
AGO    501.25
ALB     29.25
ARE      6.00
ARG     53.75
        ...  
WSM     54.75
YEM    399.75
ZAF    143.75
ZMB    233.75
ZWE    398.00
Name: mat_mort_ratio, Length: 179, dtype: float64

Notice the row labels on the left, and the values from the mat_mort_ratio column on the right.

plt.hist(mmr)
plt.title('Maternal Mortality Rate');

../_images/2d7094ca572e744dbe348ca4c03fd1946637736b43c66b1b4d9f47e0a63144b1.png

Selecting more than one column#

Let us say that we would like to avoid distraction by restricting our data frame to the columns we are interested in.

There are our columns labels:

col_labels = list(labeled_gdata)
col_labels

['country_name', 'gdp_us_billion', 'mat_mort_ratio', 'population']

Let us say that, for now, we only want the first three columns. We can do that with direct indexing with column labels (DICL) too. To do that, we put a list of strings between the square brackets, instead of the string with the column name.

Here are the column labels we want:

# Select all labels but the last.
col_labels_we_want = col_labels[:-1]
col_labels_we_want

['country_name', 'gdp_us_billion', 'mat_mort_ratio']

We can select only these columns using direct indexing:

# Direct indexing with a list of column labels
thinner_gdata = labeled_gdata[col_labels_we_want]
thinner_gdata

	country_name	gdp_us_billion	mat_mort_ratio
country_code
AFG	Afghanistan	19.961015	444.00
AGO	Angola	111.936542	501.25
ALB	Albania	12.327586	29.25
ARE	United Arab Emirates	375.027082	6.00
ARG	Argentina	550.980968	53.75
...	...	...	...
WSM	Samoa	0.799887	54.75
YEM	Yemen, Rep.	36.819337	399.75
ZAF	South Africa	345.209888	143.75
ZMB	Zambia	24.280990	233.75
ZWE	Zimbabwe	15.495514	398.00

179 rows × 3 columns

Dropping columns#

We have just used DICL to select the columns we want. In the case above, we selected three out of the four column labels, and our DICL operation had the effect of dropping the last column.

Another way of dropping a column is to use the DataFrame .drop method.

In fact the .drop method allows us to drop columns and rows. To tell .drop we want to drop columns rather than rows, we use the column= keyword argument to .drop, like this:

# Drop the population column.
thinner_again = labeled_gdata.drop(columns=['population'])
thinner_again

	country_name	gdp_us_billion	mat_mort_ratio
country_code
AFG	Afghanistan	19.961015	444.00
AGO	Angola	111.936542	501.25
ALB	Albania	12.327586	29.25
ARE	United Arab Emirates	375.027082	6.00
ARG	Argentina	550.980968	53.75
...	...	...	...
WSM	Samoa	0.799887	54.75
YEM	Yemen, Rep.	36.819337	399.75
ZAF	South Africa	345.209888	143.75
ZMB	Zambia	24.280990	233.75
ZWE	Zimbabwe	15.495514	398.00

179 rows × 3 columns

Back to GDP and MMR#

We are interested in the relationship of gpp and mmr. Maybe richer countries have better health care, and fewer maternal deaths.

Here is a plot, using the standard Matplotlib scatter function.

plt.scatter(gdp, mmr);
plt.title('Maternal Mortality Rate as a function of GDP');

../_images/9412964f2acad7006a3003d6be55e8d6daaca4000204396edb2e836dabe8670a.png

The scatterplot is not convincing in showing a relationship between GDP and the MMR. That is puzzling, given the assumption that higher GDP tends to be associated with better healthcare.

We need to explore some more. For example, could it be that, beyond a certain level of GDP, healthcare is good enough to prevent most maternal deaths? In that case, we might still expect a relationship between GDP and MMR in lower-income countries, but a much weaker relationship for lower-income countries.

To go further, we will see what we can do by sorting the DataFrame, and looking at the first and last rows.