Pandas and indexing expressions

Pandas and indexing expressions#

The subexpressions page discussed the idea of expressions that get evaluated in the course of evaluating a larger expression.

In that page you saw examples of expressions and sub-expressions that were:

Floating point literal expressions (e.g. 10.50).
Variables (e.g. a).
Call expressions (e.g. cos(0)).

This page introduces indexing expressions in Numpy and Pandas, and the way you can build up indexing expressions and sub-expressions to select data.

First we start with our usual imports.

# Load the Numpy library, rename to "np"
import numpy as np

# Load the Pandas data science library, rename to "pd"
import pandas as pd

# Safe setting for Pandas.  Needs Pandas version >= 1.5.
pd.set_option('mode.copy_on_write', True)

Numpy indexing expressions#

Let us make an array:

a = np.arange(10, 20, 2)

The first line is an assignment statements, with the variable name (a) on the Left Hand Side (LHS) of the =, and an expression (code that returns a value) on the Right Hand Side (RHS).

The RHS expression is np.arange(10, 20, 2). This is a call expression, where we call the np.arange function with three arguments. Evaluating the expression generates an array with values [10, 12, 14, 16, 18]:

array([10, 12, 14, 16, 18])

Now consider indexing into the array. Indexing occurs where we follow a value by square brackets, with something in between.

Here we follow the expression a (referring to our array value) by [2] (the expression 2 between square brackets:

# In indexing expression.
a[2]

We can see this is an expression because Jupyter shows a value when we execute the cell, meaning, that a[2] is something that returns a value — an expression.

a[2] means:

fetch the element at position 2 from the array a.

We have found that we can index arrays by putting three types of things between the square brackets attached to an array:

Integers, to specify an element at a given position in the array.
Slices. Slices are expressions containing colons of start:stop:step.
Boolean arrays.

The next three code cells are examples of indexing expressions:

a[2]  # Indexing with integers

a[1:-1:2]  # Indexing with a slice

array([12, 16])

b = np.array([True, False, True, True, False])
a[b]  # Indexing with Boolean array

array([10, 14, 16])

Pandas indexing expressions#

Here is our familiar data frame of data from the World Bank:

df = pd.read_csv('gender_stats.csv')
df

	country_name	country_code	fert_rate	gdp_us_billion	health_exp_per_cap	health_exp_pub	prim_ed_girls	mat_mort_ratio	population
0	Aruba	ABW	1.66325	NaN	NaN	NaN	48.721939	NaN	0.103744
1	Afghanistan	AFG	4.95450	19.961015	161.138034	2.834598	40.109708	444.00	32.715838
2	Angola	AGO	6.12300	111.936542	254.747970	2.447546	NaN	501.25	26.937545
3	Albania	ALB	1.76925	12.327586	574.202694	2.836021	47.201082	29.25	2.888280
4	Andorra	AND	NaN	3.197538	4421.224933	7.260281	47.123345	NaN	0.079547
...	...	...	...	...	...	...	...	...	...
211	Kosovo	XKX	2.14250	6.804620	NaN	NaN	NaN	NaN	1.813820
212	Yemen, Rep.	YEM	4.22575	36.819337	207.949700	1.417836	44.470076	399.75	26.246608
213	South Africa	ZAF	2.37525	345.209888	1123.142656	4.241441	48.516298	143.75	54.177209
214	Zambia	ZMB	5.39425	24.280990	185.556359	2.687290	49.934484	233.75	15.633220
215	Zimbabwe	ZWE	3.94300	15.495514	115.519881	2.695188	49.529875	398.00	15.420964

216 rows × 9 columns

In terms of expressions, the first line is an assignment statement, where the RHS is, as ever, an expression. In this case, it’s a call expression in which we call the pd.read_csv function with the argument gender_stats.csv, returning a DataFrame value.

The second line is an expression that evaluates to that DataFrame value, so Jupyter shows us the DataFrame attached to the name df.

Now let us consider Pandas direct indexing with column labels (DICL).

col_name = 'gdp_us_billion'
col_name

'gdp_us_billion'

The assignment statement RHS expression is a string literal expression, asking Pandas to make a str value.

Next we extract the named column with DICL into the DataFrame:

# DICL
gdp = df[col_name]
gdp

           NaN
     19.961015
    111.936542
     12.327586
      3.197538
          ...    
    6.804620
   36.819337
  345.209888
   24.280990
   15.495514
Name: gdp_us_billion, Length: 216, dtype: float64

The assignment statement RHS above is a Pandas indexing expression, with df (resulting in the DataFrame value), followed by square brackets, with an expression inside. In this case the expression gives a str value indicating the column name. When we pass an expression giving a string inside the indexing square brackets, this causes Pandas to select the column with the given label.

The result, as usual, is Pandas Series — the structure that Python uses to store a column of data.

type(gdp)

pandas.core.series.Series

DICL is the first type of Pandas indexing expression for DataFrames. In DICL, we put a column label inside the square brackets.

The other type of Pandas indexing expression that we have seen is direct indexing with Boolean Series. In this form of indexing, we put a Boolean Series inside the square brackets.

We can form a Boolean Series by evaluating a comparison expression on a Series.

# The RHS is a comparison expression on a Series
gdp_gt_1000 = gdp > 1000
# Show the result.
gdp_gt_1000

    False
    False
    False
    False
    False
       ...  
  False
  False
  False
  False
  False
Name: gdp_us_billion, Length: 216, dtype: bool

DIBS occurs where we put the Boolean Series inside the square brackets following a DataFrame value:

# DICS indexing expression
df[gdp_gt_1000]

	country_name	country_code	fert_rate	gdp_us_billion	health_exp_per_cap	health_exp_pub	prim_ed_girls	mat_mort_ratio	population
10	Australia	AUS	1.861500	1422.994116	4256.058988	6.292381	48.576707	6.00	23.444560
26	Brazil	BRA	1.795250	2198.765606	1303.199104	3.773473	47.784577	49.50	204.159544
32	Canada	CAN	1.600300	1708.473627	4616.539134	7.546247	48.808926	7.25	35.517119
35	China	CHN	1.558750	10182.790479	657.748859	3.015530	46.297964	28.75	1364.446000
49	Germany	DEU	1.450000	3601.226158	4909.659884	8.542615	48.568695	6.25	81.281645
58	Spain	ESP	1.307500	1299.724261	2963.832825	6.545739	48.722231	5.00	46.553128
63	France	FRA	2.005000	2647.649725	4387.835406	8.920420	48.772050	8.75	66.302099
67	United Kingdom	GBR	1.842500	2768.864417	3357.983675	7.720655	48.791809	9.25	64.641557
88	India	IND	2.449250	2019.005411	241.572477	1.292666	49.497234	185.25	1293.742537
94	Italy	ITA	1.390000	2005.983980	3266.984094	6.984374	48.407573	4.00	60.378795
97	Japan	JPN	1.430000	5106.024760	3687.126279	8.496074	48.744199	5.75	127.297102
104	Korea, Rep.	KOR	1.232000	1346.751162	2385.447251	3.915606	48.023388	12.00	50.727212
124	Mexico	MEX	2.257000	1188.802780	1081.208948	3.225839	48.906296	40.00	124.203450
164	Russian Federation	RUS	1.724500	1822.691700	1755.506635	3.731354	48.968070	25.25	143.793504
202	United States	USA	1.860875	17369.124600	9060.068657	8.121961	48.758830	14.00	318.558175

Notice that this has the effect of selecting rows in df for which the GDP values were greater than 1000, returning a new DataFrame, containing only those rows that have corresponding True values in the Boolean Series.

The form above is a little long-winded:

We first make a variable containing the Boolean Series
We then use the variable as the input to the DIBS indexing expression.

To be more compact, we often find ourselves calculating the Boolean Series on the fly, like this.

# DIBS indexing expression, calculating the BS inside the brackets
df[gdp > 1000]

	country_name	country_code	fert_rate	gdp_us_billion	health_exp_per_cap	health_exp_pub	prim_ed_girls	mat_mort_ratio	population
10	Australia	AUS	1.861500	1422.994116	4256.058988	6.292381	48.576707	6.00	23.444560
26	Brazil	BRA	1.795250	2198.765606	1303.199104	3.773473	47.784577	49.50	204.159544
32	Canada	CAN	1.600300	1708.473627	4616.539134	7.546247	48.808926	7.25	35.517119
35	China	CHN	1.558750	10182.790479	657.748859	3.015530	46.297964	28.75	1364.446000
49	Germany	DEU	1.450000	3601.226158	4909.659884	8.542615	48.568695	6.25	81.281645
58	Spain	ESP	1.307500	1299.724261	2963.832825	6.545739	48.722231	5.00	46.553128
63	France	FRA	2.005000	2647.649725	4387.835406	8.920420	48.772050	8.75	66.302099
67	United Kingdom	GBR	1.842500	2768.864417	3357.983675	7.720655	48.791809	9.25	64.641557
88	India	IND	2.449250	2019.005411	241.572477	1.292666	49.497234	185.25	1293.742537
94	Italy	ITA	1.390000	2005.983980	3266.984094	6.984374	48.407573	4.00	60.378795
97	Japan	JPN	1.430000	5106.024760	3687.126279	8.496074	48.744199	5.75	127.297102
104	Korea, Rep.	KOR	1.232000	1346.751162	2385.447251	3.915606	48.023388	12.00	50.727212
124	Mexico	MEX	2.257000	1188.802780	1081.208948	3.225839	48.906296	40.00	124.203450
164	Russian Federation	RUS	1.724500	1822.691700	1755.506635	3.731354	48.968070	25.25	143.793504
202	United States	USA	1.860875	17369.124600	9060.068657	8.121961	48.758830	14.00	318.558175

In the expression above, Python first evaluates the expression inside the square brackets: gdp > 1000. Evaluating this expression gives a value, which is a Boolean Series — the same Boolean Series we made before, and attached to the variable gdp_gt_1000. Because the value is inside square brackets, the DataFrame will get this to do indexing, and again, will select the rows with True for the gdp > 1000 comparison.

Notice that, in order to run the code above, we had to first make the gdp variable containing the Series.

Sometimes we want to get the same result by generating the Series, and then the Boolean Series, using expressions inside the square brackets. For example, consider the following code:

df[df['gdp_us_billion'] > 1000]

	country_name	country_code	fert_rate	gdp_us_billion	health_exp_per_cap	health_exp_pub	prim_ed_girls	mat_mort_ratio	population
10	Australia	AUS	1.861500	1422.994116	4256.058988	6.292381	48.576707	6.00	23.444560
26	Brazil	BRA	1.795250	2198.765606	1303.199104	3.773473	47.784577	49.50	204.159544
32	Canada	CAN	1.600300	1708.473627	4616.539134	7.546247	48.808926	7.25	35.517119
35	China	CHN	1.558750	10182.790479	657.748859	3.015530	46.297964	28.75	1364.446000
49	Germany	DEU	1.450000	3601.226158	4909.659884	8.542615	48.568695	6.25	81.281645
58	Spain	ESP	1.307500	1299.724261	2963.832825	6.545739	48.722231	5.00	46.553128
63	France	FRA	2.005000	2647.649725	4387.835406	8.920420	48.772050	8.75	66.302099
67	United Kingdom	GBR	1.842500	2768.864417	3357.983675	7.720655	48.791809	9.25	64.641557
88	India	IND	2.449250	2019.005411	241.572477	1.292666	49.497234	185.25	1293.742537
94	Italy	ITA	1.390000	2005.983980	3266.984094	6.984374	48.407573	4.00	60.378795
97	Japan	JPN	1.430000	5106.024760	3687.126279	8.496074	48.744199	5.75	127.297102
104	Korea, Rep.	KOR	1.232000	1346.751162	2385.447251	3.915606	48.023388	12.00	50.727212
124	Mexico	MEX	2.257000	1188.802780	1081.208948	3.225839	48.906296	40.00	124.203450
164	Russian Federation	RUS	1.724500	1822.691700	1755.506635	3.731354	48.968070	25.25	143.793504
202	United States	USA	1.860875	17369.124600	9060.068657	8.121961	48.758830	14.00	318.558175

Notice that this has the same effect as our previous DIBS indexing expressions, because it ends up doing the same indexing.

Read the expression above as a series of sub-expressions:

'gdp_us_billion', a string literal expression, giving a string
1000, a integer literal expression, giving an integer.
df['gdp_us_billion'], a DICL indexing expression, giving a Series,
df['gdp_us_billion'] > 1000, a comparison expression, comparing the Series from 3. above to 1000 from 2, and resulting in a Boolean Series.

The whole expression is then a DIBS expression, indexing into a Pandas DataFrame with a Boolean Series.

Pandas and indexing expressions

Contents

Pandas and indexing expressions#

Numpy indexing expressions#

Pandas indexing expressions#