Pandas and indexing expressions#
The subexpressions page discussed the idea of expressions that get evaluated in the course of evaluating a larger expression.
In that page you saw examples of expressions and sub-expressions that were:
Floating point literal expressions (e.g.
10.50
).Variables (e.g.
a
).Call expressions (e.g.
cos(0)
).
This page introduces indexing expressions in Numpy and Pandas, and the way you can build up indexing expressions and sub-expressions to select data.
First we start with our usual imports.
# Load the Numpy library, rename to "np"
import numpy as np
# Load the Pandas data science library, rename to "pd"
import pandas as pd
# Safe setting for Pandas. Needs Pandas version >= 1.5.
pd.set_option('mode.copy_on_write', True)
Numpy indexing expressions#
Let us make an array:
a = np.arange(10, 20, 2)
The first line is an assignment statements, with the variable name (a
) on
the Left Hand Side (LHS) of the =
, and an expression (code that returns a
value) on the Right Hand Side (RHS).
The RHS expression is np.arange(10, 20, 2)
. This is a call expression,
where we call the np.arange
function with three arguments. Evaluating the
expression generates an array with values [10, 12, 14, 16, 18]
:
a
array([10, 12, 14, 16, 18])
Now consider indexing into the array. Indexing occurs where we follow a value by square brackets, with something in between.
Here we follow the expression a
(referring to our array value) by [2]
(the expression 2
between square brackets:
# In indexing expression.
a[2]
14
We can see this is an expression because Jupyter shows a value when we execute
the cell, meaning, that a[2]
is something that returns a value — an
expression.
a[2]
means:
fetch the element at position 2 from the array
a
.
We have found that we can index arrays by putting three types of things between the square brackets attached to an array:
Integers, to specify an element at a given position in the array.
Slices. Slices are expressions containing colons of
start:stop:step
.Boolean arrays.
The next three code cells are examples of indexing expressions:
a[2] # Indexing with integers
14
a[1:-1:2] # Indexing with a slice
array([12, 16])
b = np.array([True, False, True, True, False])
a[b] # Indexing with Boolean array
array([10, 14, 16])
Pandas indexing expressions#
Here is our familiar data frame of data from the World Bank:
df = pd.read_csv('gender_stats.csv')
df
country_name | country_code | fert_rate | gdp_us_billion | health_exp_per_cap | health_exp_pub | prim_ed_girls | mat_mort_ratio | population | |
---|---|---|---|---|---|---|---|---|---|
0 | Aruba | ABW | 1.66325 | NaN | NaN | NaN | 48.721939 | NaN | 0.103744 |
1 | Afghanistan | AFG | 4.95450 | 19.961015 | 161.138034 | 2.834598 | 40.109708 | 444.00 | 32.715838 |
2 | Angola | AGO | 6.12300 | 111.936542 | 254.747970 | 2.447546 | NaN | 501.25 | 26.937545 |
3 | Albania | ALB | 1.76925 | 12.327586 | 574.202694 | 2.836021 | 47.201082 | 29.25 | 2.888280 |
4 | Andorra | AND | NaN | 3.197538 | 4421.224933 | 7.260281 | 47.123345 | NaN | 0.079547 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
211 | Kosovo | XKX | 2.14250 | 6.804620 | NaN | NaN | NaN | NaN | 1.813820 |
212 | Yemen, Rep. | YEM | 4.22575 | 36.819337 | 207.949700 | 1.417836 | 44.470076 | 399.75 | 26.246608 |
213 | South Africa | ZAF | 2.37525 | 345.209888 | 1123.142656 | 4.241441 | 48.516298 | 143.75 | 54.177209 |
214 | Zambia | ZMB | 5.39425 | 24.280990 | 185.556359 | 2.687290 | 49.934484 | 233.75 | 15.633220 |
215 | Zimbabwe | ZWE | 3.94300 | 15.495514 | 115.519881 | 2.695188 | 49.529875 | 398.00 | 15.420964 |
216 rows × 9 columns
In terms of expressions, the first line is an assignment statement, where the
RHS is, as ever, an expression. In this case, it’s a call expression in
which we call the pd.read_csv
function with the argument gender_stats.csv
,
returning a DataFrame value.
The second line is an expression that evaluates to that DataFrame value, so
Jupyter shows us the DataFrame attached to the name df
.
Now let us consider Pandas direct indexing with column labels (DICL).
col_name = 'gdp_us_billion'
col_name
'gdp_us_billion'
The assignment statement RHS expression is a string literal expression, asking Pandas to make a str
value.
Next we extract the named column with DICL into the DataFrame:
# DICL
gdp = df[col_name]
gdp
0 NaN
1 19.961015
2 111.936542
3 12.327586
4 3.197538
...
211 6.804620
212 36.819337
213 345.209888
214 24.280990
215 15.495514
Name: gdp_us_billion, Length: 216, dtype: float64
The assignment statement RHS above is a Pandas indexing expression, with df
(resulting in the DataFrame value), followed by square brackets, with an
expression inside. In this case the expression gives a str
value indicating
the column name. When we pass an expression giving a string inside the indexing
square brackets, this causes Pandas to select the column with the given
label.
The result, as usual, is Pandas Series
— the structure that Python uses to
store a column of data.
type(gdp)
pandas.core.series.Series
DICL is the first type of Pandas indexing expression for DataFrames. In DICL, we put a column label inside the square brackets.
The other type of Pandas indexing expression that we have seen is direct indexing with Boolean Series. In this form of indexing, we put a Boolean Series inside the square brackets.
We can form a Boolean Series by evaluating a comparison expression on a Series.
# The RHS is a comparison expression on a Series
gdp_gt_1000 = gdp > 1000
# Show the result.
gdp_gt_1000
0 False
1 False
2 False
3 False
4 False
...
211 False
212 False
213 False
214 False
215 False
Name: gdp_us_billion, Length: 216, dtype: bool
DIBS occurs where we put the Boolean Series inside the square brackets following a DataFrame value:
# DICS indexing expression
df[gdp_gt_1000]
country_name | country_code | fert_rate | gdp_us_billion | health_exp_per_cap | health_exp_pub | prim_ed_girls | mat_mort_ratio | population | |
---|---|---|---|---|---|---|---|---|---|
10 | Australia | AUS | 1.861500 | 1422.994116 | 4256.058988 | 6.292381 | 48.576707 | 6.00 | 23.444560 |
26 | Brazil | BRA | 1.795250 | 2198.765606 | 1303.199104 | 3.773473 | 47.784577 | 49.50 | 204.159544 |
32 | Canada | CAN | 1.600300 | 1708.473627 | 4616.539134 | 7.546247 | 48.808926 | 7.25 | 35.517119 |
35 | China | CHN | 1.558750 | 10182.790479 | 657.748859 | 3.015530 | 46.297964 | 28.75 | 1364.446000 |
49 | Germany | DEU | 1.450000 | 3601.226158 | 4909.659884 | 8.542615 | 48.568695 | 6.25 | 81.281645 |
58 | Spain | ESP | 1.307500 | 1299.724261 | 2963.832825 | 6.545739 | 48.722231 | 5.00 | 46.553128 |
63 | France | FRA | 2.005000 | 2647.649725 | 4387.835406 | 8.920420 | 48.772050 | 8.75 | 66.302099 |
67 | United Kingdom | GBR | 1.842500 | 2768.864417 | 3357.983675 | 7.720655 | 48.791809 | 9.25 | 64.641557 |
88 | India | IND | 2.449250 | 2019.005411 | 241.572477 | 1.292666 | 49.497234 | 185.25 | 1293.742537 |
94 | Italy | ITA | 1.390000 | 2005.983980 | 3266.984094 | 6.984374 | 48.407573 | 4.00 | 60.378795 |
97 | Japan | JPN | 1.430000 | 5106.024760 | 3687.126279 | 8.496074 | 48.744199 | 5.75 | 127.297102 |
104 | Korea, Rep. | KOR | 1.232000 | 1346.751162 | 2385.447251 | 3.915606 | 48.023388 | 12.00 | 50.727212 |
124 | Mexico | MEX | 2.257000 | 1188.802780 | 1081.208948 | 3.225839 | 48.906296 | 40.00 | 124.203450 |
164 | Russian Federation | RUS | 1.724500 | 1822.691700 | 1755.506635 | 3.731354 | 48.968070 | 25.25 | 143.793504 |
202 | United States | USA | 1.860875 | 17369.124600 | 9060.068657 | 8.121961 | 48.758830 | 14.00 | 318.558175 |
Notice that this has the effect of selecting rows in df
for which the GDP
values were greater than 1000, returning a new DataFrame, containing only those
rows that have corresponding True values in the Boolean Series.
The form above is a little long-winded:
We first make a variable containing the Boolean Series
We then use the variable as the input to the DIBS indexing expression.
To be more compact, we often find ourselves calculating the Boolean Series on the fly, like this.
# DIBS indexing expression, calculating the BS inside the brackets
df[gdp > 1000]
country_name | country_code | fert_rate | gdp_us_billion | health_exp_per_cap | health_exp_pub | prim_ed_girls | mat_mort_ratio | population | |
---|---|---|---|---|---|---|---|---|---|
10 | Australia | AUS | 1.861500 | 1422.994116 | 4256.058988 | 6.292381 | 48.576707 | 6.00 | 23.444560 |
26 | Brazil | BRA | 1.795250 | 2198.765606 | 1303.199104 | 3.773473 | 47.784577 | 49.50 | 204.159544 |
32 | Canada | CAN | 1.600300 | 1708.473627 | 4616.539134 | 7.546247 | 48.808926 | 7.25 | 35.517119 |
35 | China | CHN | 1.558750 | 10182.790479 | 657.748859 | 3.015530 | 46.297964 | 28.75 | 1364.446000 |
49 | Germany | DEU | 1.450000 | 3601.226158 | 4909.659884 | 8.542615 | 48.568695 | 6.25 | 81.281645 |
58 | Spain | ESP | 1.307500 | 1299.724261 | 2963.832825 | 6.545739 | 48.722231 | 5.00 | 46.553128 |
63 | France | FRA | 2.005000 | 2647.649725 | 4387.835406 | 8.920420 | 48.772050 | 8.75 | 66.302099 |
67 | United Kingdom | GBR | 1.842500 | 2768.864417 | 3357.983675 | 7.720655 | 48.791809 | 9.25 | 64.641557 |
88 | India | IND | 2.449250 | 2019.005411 | 241.572477 | 1.292666 | 49.497234 | 185.25 | 1293.742537 |
94 | Italy | ITA | 1.390000 | 2005.983980 | 3266.984094 | 6.984374 | 48.407573 | 4.00 | 60.378795 |
97 | Japan | JPN | 1.430000 | 5106.024760 | 3687.126279 | 8.496074 | 48.744199 | 5.75 | 127.297102 |
104 | Korea, Rep. | KOR | 1.232000 | 1346.751162 | 2385.447251 | 3.915606 | 48.023388 | 12.00 | 50.727212 |
124 | Mexico | MEX | 2.257000 | 1188.802780 | 1081.208948 | 3.225839 | 48.906296 | 40.00 | 124.203450 |
164 | Russian Federation | RUS | 1.724500 | 1822.691700 | 1755.506635 | 3.731354 | 48.968070 | 25.25 | 143.793504 |
202 | United States | USA | 1.860875 | 17369.124600 | 9060.068657 | 8.121961 | 48.758830 | 14.00 | 318.558175 |
In the expression above, Python first evaluates the expression inside the
square brackets: gdp > 1000
. Evaluating this expression gives a value, which
is a Boolean Series — the same Boolean Series we made before, and attached to
the variable gdp_gt_1000
. Because the value is inside square brackets, the
DataFrame will get this to do indexing, and again, will select the rows with
True for the gdp > 1000
comparison.
Notice that, in order to run the code above, we had to first make the gdp
variable containing the Series.
Sometimes we want to get the same result by generating the Series, and then the Boolean Series, using expressions inside the square brackets. For example, consider the following code:
df[df['gdp_us_billion'] > 1000]
country_name | country_code | fert_rate | gdp_us_billion | health_exp_per_cap | health_exp_pub | prim_ed_girls | mat_mort_ratio | population | |
---|---|---|---|---|---|---|---|---|---|
10 | Australia | AUS | 1.861500 | 1422.994116 | 4256.058988 | 6.292381 | 48.576707 | 6.00 | 23.444560 |
26 | Brazil | BRA | 1.795250 | 2198.765606 | 1303.199104 | 3.773473 | 47.784577 | 49.50 | 204.159544 |
32 | Canada | CAN | 1.600300 | 1708.473627 | 4616.539134 | 7.546247 | 48.808926 | 7.25 | 35.517119 |
35 | China | CHN | 1.558750 | 10182.790479 | 657.748859 | 3.015530 | 46.297964 | 28.75 | 1364.446000 |
49 | Germany | DEU | 1.450000 | 3601.226158 | 4909.659884 | 8.542615 | 48.568695 | 6.25 | 81.281645 |
58 | Spain | ESP | 1.307500 | 1299.724261 | 2963.832825 | 6.545739 | 48.722231 | 5.00 | 46.553128 |
63 | France | FRA | 2.005000 | 2647.649725 | 4387.835406 | 8.920420 | 48.772050 | 8.75 | 66.302099 |
67 | United Kingdom | GBR | 1.842500 | 2768.864417 | 3357.983675 | 7.720655 | 48.791809 | 9.25 | 64.641557 |
88 | India | IND | 2.449250 | 2019.005411 | 241.572477 | 1.292666 | 49.497234 | 185.25 | 1293.742537 |
94 | Italy | ITA | 1.390000 | 2005.983980 | 3266.984094 | 6.984374 | 48.407573 | 4.00 | 60.378795 |
97 | Japan | JPN | 1.430000 | 5106.024760 | 3687.126279 | 8.496074 | 48.744199 | 5.75 | 127.297102 |
104 | Korea, Rep. | KOR | 1.232000 | 1346.751162 | 2385.447251 | 3.915606 | 48.023388 | 12.00 | 50.727212 |
124 | Mexico | MEX | 2.257000 | 1188.802780 | 1081.208948 | 3.225839 | 48.906296 | 40.00 | 124.203450 |
164 | Russian Federation | RUS | 1.724500 | 1822.691700 | 1755.506635 | 3.731354 | 48.968070 | 25.25 | 143.793504 |
202 | United States | USA | 1.860875 | 17369.124600 | 9060.068657 | 8.121961 | 48.758830 | 14.00 | 318.558175 |
Notice that this has the same effect as our previous DIBS indexing expressions, because it ends up doing the same indexing.
Read the expression above as a series of sub-expressions:
'gdp_us_billion'
, a string literal expression, giving a string1000
, a integer literal expression, giving an integer.df['gdp_us_billion']
, a DICL indexing expression, giving a Series,df['gdp_us_billion'] > 1000
, a comparison expression, comparing the Series from 3. above to 1000 from 2, and resulting in a Boolean Series.
The whole expression is then a DIBS expression, indexing into a Pandas DataFrame with a Boolean Series.