Selecting rows from DataFrames

Selecting rows from DataFrames#

# Load the Pandas data science library, call it 'pd'
import pandas as pd
# Turn on a setting to use Pandas more safely.
# We will discuss this setting later.
pd.set_option('mode.copy_on_write', True)

# Load the library for plotting, name it 'plt'
import matplotlib.pyplot as plt
# Make plots look a little more fancy
plt.style.use('fivethirtyeight')

We return again to the DataFrame we were working on in basic column indexing.

# Original data frame before dropping missing values.
gender_data = pd.read_csv('gender_stats_min.csv')
labeled_gdata = gender_data.dropna().set_index('country_code')
# Show the result
labeled_gdata
country_name gdp_us_billion mat_mort_ratio population
country_code
AFG Afghanistan 19.961015 444.00 32.715838
AGO Angola 111.936542 501.25 26.937545
ALB Albania 12.327586 29.25 2.888280
ARE United Arab Emirates 375.027082 6.00 9.080299
ARG Argentina 550.980968 53.75 42.976675
... ... ... ... ...
WSM Samoa 0.799887 54.75 0.192225
YEM Yemen, Rep. 36.819337 399.75 26.246608
ZAF South Africa 345.209888 143.75 54.177209
ZMB Zambia 24.280990 233.75 15.633220
ZWE Zimbabwe 15.495514 398.00 15.420964

179 rows × 4 columns

Thus far we have used DataFrame indexing to select columns from DataFrames.

Remember, indexing is where we use square brackets following a value in order to select data from inside that value.

We also often want to select rows from DataFrames. You have already seen the .head and .tail methods, for selecting rows.

This page covers a more general, and more useful way of selecting rows, using direct indexing with a Boolean Series (DIBS).

Direct indexing with a Boolean Series#

We often want to select rows from the data frame that match some criterion.

For example, let us return to the problem of comparing richer and poorer countries. So far we have sorted the DataFrame with .sort_values and then selected the first \(n\) or last \(n\) values, using .head and .tail. But what if we wanted to select the rows corresponding the countries with a GDP above some threshold.

Here’s are the GDP values as a Series. We are using direct indexing with column labels (DICL) on the DataFrame, to get the column as a Series:

# Direct indexing with column labels (DICL).
gdp = labeled_gdata['gdp_us_billion']
gdp
country_code
AFG     19.961015
AGO    111.936542
ALB     12.327586
ARE    375.027082
ARG    550.980968
          ...    
WSM      0.799887
YEM     36.819337
ZAF    345.209888
ZMB     24.280990
ZWE     15.495514
Name: gdp_us_billion, Length: 179, dtype: float64

Here’s the histogram of GDP again:

plt.hist(gdp)
(array([169.,   6.,   2.,   0.,   0.,   1.,   0.,   0.,   0.,   1.]),
 array([1.77430636e-01, 1.73707215e+03, 3.47396686e+03, 5.21086158e+03,
        6.94775630e+03, 8.68465102e+03, 1.04215457e+04, 1.21584404e+04,
        1.38953352e+04, 1.56322299e+04, 1.73691246e+04]),
 <BarContainer object of 10 artists>)
../_images/0b61674bba388afe565cd756740f07ab70d5203969fbc959ab6827c9788cbbf0.png

Looking at the histogram, we could try 1000 (billion US dollars) as a threshold to identify high GDP countries.

As you have found in the Series are like arrays page, Series are like arrays in many ways. As you remember, if you do a comparison on an array, you get a Boolean array, where each value in the Boolean array is the True or False answer to the comparison question for the corresponding element in the original array. If you do a comparison on a Series, you get a Boolean Series, where each value in the Series is the True or False answer to the comparison question for the corresponding element in the original Series.

For example, here we do a comparison on the gdp series, and we get a Boolean Series.

gdp_gt_1000 = gdp > 1000
gdp_gt_1000
country_code
AFG    False
AGO    False
ALB    False
ARE    False
ARG    False
       ...  
WSM    False
YEM    False
ZAF    False
ZMB    False
ZWE    False
Name: gdp_us_billion, Length: 179, dtype: bool

Notice that the Boolean Series has the same row labels (index) as the original Series (here, gdp).

The Boolean Series is — a Series:

type(gdp_gt_1000)
pandas.core.series.Series

We can use this Boolean series to select rows from the DataFrame, by Boolean indexingdirect indexing with Boolean Series (DIBS).

When we index using the Boolean Series inside the square brackets, it works like as it does when we index an array with a Boolean array:

rich_gender_data = labeled_gdata[gdp_gt_1000]
rich_gender_data
country_name gdp_us_billion mat_mort_ratio population
country_code
AUS Australia 1422.994116 6.00 23.444560
BRA Brazil 2198.765606 49.50 204.159544
CAN Canada 1708.473627 7.25 35.517119
CHN China 10182.790479 28.75 1364.446000
DEU Germany 3601.226158 6.25 81.281645
ESP Spain 1299.724261 5.00 46.553128
FRA France 2647.649725 8.75 66.302099
GBR United Kingdom 2768.864417 9.25 64.641557
IND India 2019.005411 185.25 1293.742537
ITA Italy 2005.983980 4.00 60.378795
JPN Japan 5106.024760 5.75 127.297102
KOR Korea, Rep. 1346.751162 12.00 50.727212
MEX Mexico 1188.802780 40.00 124.203450
RUS Russian Federation 1822.691700 25.25 143.793504
USA United States 17369.124600 14.00 318.558175
type(rich_gender_data)
pandas.core.frame.DataFrame

rich_gender_data is a new data frame, that is a subset of the original labeled_gdata frame. It contains only the rows where the GDP value is greater than 1000 billion dollars. Check the display of rich_gender_data above to confirm that the values in the gdp_us_billion column are all greater than 1000.

We can do a scatter plot of GDP values against maternal mortality rate, and we find again that, for rich countries, there is little relationship between GDP and maternal mortality.

plt.scatter(rich_gender_data['gdp_us_billion'],
            rich_gender_data['mat_mort_ratio'])
<matplotlib.collections.PathCollection at 0x7f66191ab890>
../_images/17ef1b5d3713b254d2979ac52d58a2f4e19a6a4a6399961c6a19544058f74e1e.png