Pandas and indexing expressions

Pandas and indexing expressions#

The subexpressions page discussed the idea of expressions that get evaluated in the course of evaluating a larger expression.

In that page you saw examples of expressions and sub-expressions that were:

  • Floating point literal expressions (e.g. 10.50).

  • Variables (e.g. a).

  • Call expressions (e.g. cos(0)).

This page introduces indexing expressions in Numpy and Pandas, and the way you can build up indexing expressions and sub-expressions to select data.

First we start with our usual imports.

# Load the Numpy library, rename to "np"
import numpy as np

# Load the Pandas data science library, rename to "pd"
import pandas as pd

# Safe setting for Pandas.  Needs Pandas version >= 1.5.
pd.set_option('mode.copy_on_write', True)

Numpy indexing expressions#

Let us make an array:

a = np.arange(10, 20, 2)

The first line is an assignment statements, with the variable name (a) on the Left Hand Side (LHS) of the =, and an expression (code that returns a value) on the Right Hand Side (RHS).

The RHS expression is np.arange(10, 20, 2). This is a call expression, where we call the np.arange function with three arguments. Evaluating the expression generates an array with values [10, 12, 14, 16, 18]:

a
array([10, 12, 14, 16, 18])

Now consider indexing into the array. Indexing occurs where we follow a value by square brackets, with something in between.

Here we follow the expression a (referring to our array value) by [2] (the expression 2 between square brackets:

# In indexing expression.
a[2]
14

We can see this is an expression because Jupyter shows a value when we execute the cell, meaning, that a[2] is something that returns a value — an expression.

a[2] means:

fetch the element at position 2 from the array a.

We have found that we can index arrays by putting three types of things between the square brackets attached to an array:

  • Integers, to specify an element at a given position in the array.

  • Slices. Slices are expressions containing colons of start:stop:step.

  • Boolean arrays.

The next three code cells are examples of indexing expressions:

a[2]  # Indexing with integers
14
a[1:-1:2]  # Indexing with a slice
array([12, 16])
b = np.array([True, False, True, True, False])
a[b]  # Indexing with Boolean array
array([10, 14, 16])

Pandas indexing expressions#

Here is our familiar data frame of data from the World Bank:

df = pd.read_csv('gender_stats.csv')
df
country_name country_code fert_rate gdp_us_billion health_exp_per_cap health_exp_pub prim_ed_girls mat_mort_ratio population
0 Aruba ABW 1.66325 NaN NaN NaN 48.721939 NaN 0.103744
1 Afghanistan AFG 4.95450 19.961015 161.138034 2.834598 40.109708 444.00 32.715838
2 Angola AGO 6.12300 111.936542 254.747970 2.447546 NaN 501.25 26.937545
3 Albania ALB 1.76925 12.327586 574.202694 2.836021 47.201082 29.25 2.888280
4 Andorra AND NaN 3.197538 4421.224933 7.260281 47.123345 NaN 0.079547
... ... ... ... ... ... ... ... ... ...
211 Kosovo XKX 2.14250 6.804620 NaN NaN NaN NaN 1.813820
212 Yemen, Rep. YEM 4.22575 36.819337 207.949700 1.417836 44.470076 399.75 26.246608
213 South Africa ZAF 2.37525 345.209888 1123.142656 4.241441 48.516298 143.75 54.177209
214 Zambia ZMB 5.39425 24.280990 185.556359 2.687290 49.934484 233.75 15.633220
215 Zimbabwe ZWE 3.94300 15.495514 115.519881 2.695188 49.529875 398.00 15.420964

216 rows × 9 columns

In terms of expressions, the first line is an assignment statement, where the RHS is, as ever, an expression. In this case, it’s a call expression in which we call the pd.read_csv function with the argument gender_stats.csv, returning a DataFrame value.

The second line is an expression that evaluates to that DataFrame value, so Jupyter shows us the DataFrame attached to the name df.

Now let us consider Pandas direct indexing with column labels (DICL).

col_name = 'gdp_us_billion'
col_name
'gdp_us_billion'

The assignment statement RHS expression is a string literal expression, asking Pandas to make a str value.

Next we extract the named column with DICL into the DataFrame:

# DICL
gdp = df[col_name]
gdp
0             NaN
1       19.961015
2      111.936542
3       12.327586
4        3.197538
          ...    
211      6.804620
212     36.819337
213    345.209888
214     24.280990
215     15.495514
Name: gdp_us_billion, Length: 216, dtype: float64

The assignment statement RHS above is a Pandas indexing expression, with df (resulting in the DataFrame value), followed by square brackets, with an expression inside. In this case the expression gives a str value indicating the column name. When we pass an expression giving a string inside the indexing square brackets, this causes Pandas to select the column with the given label.

The result, as usual, is Pandas Series — the structure that Python uses to store a column of data.

type(gdp)
pandas.core.series.Series

DICL is the first type of Pandas indexing expression for DataFrames. In DICL, we put a column label inside the square brackets.

The other type of Pandas indexing expression that we have seen is direct indexing with Boolean Series. In this form of indexing, we put a Boolean Series inside the square brackets.

We can form a Boolean Series by evaluating a comparison expression on a Series.

# The RHS is a comparison expression on a Series
gdp_gt_1000 = gdp > 1000
# Show the result.
gdp_gt_1000
0      False
1      False
2      False
3      False
4      False
       ...  
211    False
212    False
213    False
214    False
215    False
Name: gdp_us_billion, Length: 216, dtype: bool

DIBS occurs where we put the Boolean Series inside the square brackets following a DataFrame value:

# DICS indexing expression
df[gdp_gt_1000]
country_name country_code fert_rate gdp_us_billion health_exp_per_cap health_exp_pub prim_ed_girls mat_mort_ratio population
10 Australia AUS 1.861500 1422.994116 4256.058988 6.292381 48.576707 6.00 23.444560
26 Brazil BRA 1.795250 2198.765606 1303.199104 3.773473 47.784577 49.50 204.159544
32 Canada CAN 1.600300 1708.473627 4616.539134 7.546247 48.808926 7.25 35.517119
35 China CHN 1.558750 10182.790479 657.748859 3.015530 46.297964 28.75 1364.446000
49 Germany DEU 1.450000 3601.226158 4909.659884 8.542615 48.568695 6.25 81.281645
58 Spain ESP 1.307500 1299.724261 2963.832825 6.545739 48.722231 5.00 46.553128
63 France FRA 2.005000 2647.649725 4387.835406 8.920420 48.772050 8.75 66.302099
67 United Kingdom GBR 1.842500 2768.864417 3357.983675 7.720655 48.791809 9.25 64.641557
88 India IND 2.449250 2019.005411 241.572477 1.292666 49.497234 185.25 1293.742537
94 Italy ITA 1.390000 2005.983980 3266.984094 6.984374 48.407573 4.00 60.378795
97 Japan JPN 1.430000 5106.024760 3687.126279 8.496074 48.744199 5.75 127.297102
104 Korea, Rep. KOR 1.232000 1346.751162 2385.447251 3.915606 48.023388 12.00 50.727212
124 Mexico MEX 2.257000 1188.802780 1081.208948 3.225839 48.906296 40.00 124.203450
164 Russian Federation RUS 1.724500 1822.691700 1755.506635 3.731354 48.968070 25.25 143.793504
202 United States USA 1.860875 17369.124600 9060.068657 8.121961 48.758830 14.00 318.558175

Notice that this has the effect of selecting rows in df for which the GDP values were greater than 1000, returning a new DataFrame, containing only those rows that have corresponding True values in the Boolean Series.

The form above is a little long-winded:

  • We first make a variable containing the Boolean Series

  • We then use the variable as the input to the DIBS indexing expression.

To be more compact, we often find ourselves calculating the Boolean Series on the fly, like this.

# DIBS indexing expression, calculating the BS inside the brackets
df[gdp > 1000]
country_name country_code fert_rate gdp_us_billion health_exp_per_cap health_exp_pub prim_ed_girls mat_mort_ratio population
10 Australia AUS 1.861500 1422.994116 4256.058988 6.292381 48.576707 6.00 23.444560
26 Brazil BRA 1.795250 2198.765606 1303.199104 3.773473 47.784577 49.50 204.159544
32 Canada CAN 1.600300 1708.473627 4616.539134 7.546247 48.808926 7.25 35.517119
35 China CHN 1.558750 10182.790479 657.748859 3.015530 46.297964 28.75 1364.446000
49 Germany DEU 1.450000 3601.226158 4909.659884 8.542615 48.568695 6.25 81.281645
58 Spain ESP 1.307500 1299.724261 2963.832825 6.545739 48.722231 5.00 46.553128
63 France FRA 2.005000 2647.649725 4387.835406 8.920420 48.772050 8.75 66.302099
67 United Kingdom GBR 1.842500 2768.864417 3357.983675 7.720655 48.791809 9.25 64.641557
88 India IND 2.449250 2019.005411 241.572477 1.292666 49.497234 185.25 1293.742537
94 Italy ITA 1.390000 2005.983980 3266.984094 6.984374 48.407573 4.00 60.378795
97 Japan JPN 1.430000 5106.024760 3687.126279 8.496074 48.744199 5.75 127.297102
104 Korea, Rep. KOR 1.232000 1346.751162 2385.447251 3.915606 48.023388 12.00 50.727212
124 Mexico MEX 2.257000 1188.802780 1081.208948 3.225839 48.906296 40.00 124.203450
164 Russian Federation RUS 1.724500 1822.691700 1755.506635 3.731354 48.968070 25.25 143.793504
202 United States USA 1.860875 17369.124600 9060.068657 8.121961 48.758830 14.00 318.558175

In the expression above, Python first evaluates the expression inside the square brackets: gdp > 1000. Evaluating this expression gives a value, which is a Boolean Series — the same Boolean Series we made before, and attached to the variable gdp_gt_1000. Because the value is inside square brackets, the DataFrame will get this to do indexing, and again, will select the rows with True for the gdp > 1000 comparison.

Notice that, in order to run the code above, we had to first make the gdp variable containing the Series.

Sometimes we want to get the same result by generating the Series, and then the Boolean Series, using expressions inside the square brackets. For example, consider the following code:

df[df['gdp_us_billion'] > 1000]
country_name country_code fert_rate gdp_us_billion health_exp_per_cap health_exp_pub prim_ed_girls mat_mort_ratio population
10 Australia AUS 1.861500 1422.994116 4256.058988 6.292381 48.576707 6.00 23.444560
26 Brazil BRA 1.795250 2198.765606 1303.199104 3.773473 47.784577 49.50 204.159544
32 Canada CAN 1.600300 1708.473627 4616.539134 7.546247 48.808926 7.25 35.517119
35 China CHN 1.558750 10182.790479 657.748859 3.015530 46.297964 28.75 1364.446000
49 Germany DEU 1.450000 3601.226158 4909.659884 8.542615 48.568695 6.25 81.281645
58 Spain ESP 1.307500 1299.724261 2963.832825 6.545739 48.722231 5.00 46.553128
63 France FRA 2.005000 2647.649725 4387.835406 8.920420 48.772050 8.75 66.302099
67 United Kingdom GBR 1.842500 2768.864417 3357.983675 7.720655 48.791809 9.25 64.641557
88 India IND 2.449250 2019.005411 241.572477 1.292666 49.497234 185.25 1293.742537
94 Italy ITA 1.390000 2005.983980 3266.984094 6.984374 48.407573 4.00 60.378795
97 Japan JPN 1.430000 5106.024760 3687.126279 8.496074 48.744199 5.75 127.297102
104 Korea, Rep. KOR 1.232000 1346.751162 2385.447251 3.915606 48.023388 12.00 50.727212
124 Mexico MEX 2.257000 1188.802780 1081.208948 3.225839 48.906296 40.00 124.203450
164 Russian Federation RUS 1.724500 1822.691700 1755.506635 3.731354 48.968070 25.25 143.793504
202 United States USA 1.860875 17369.124600 9060.068657 8.121961 48.758830 14.00 318.558175

Notice that this has the same effect as our previous DIBS indexing expressions, because it ends up doing the same indexing.

Read the expression above as a series of sub-expressions:

  1. 'gdp_us_billion', a string literal expression, giving a string

  2. 1000, a integer literal expression, giving an integer.

  3. df['gdp_us_billion'], a DICL indexing expression, giving a Series,

  4. df['gdp_us_billion'] > 1000, a comparison expression, comparing the Series from 3. above to 1000 from 2, and resulting in a Boolean Series.

The whole expression is then a DIBS expression, indexing into a Pandas DataFrame with a Boolean Series.