Indexing in Pandas

Indexing in Pandas#

Way back in the basic row indexing, we showed row selection by indexing into data frames with Boolean vectors, such as Pandas Series. We called this direct indexing with Boolean Series (DIBS). We also covered selecting columns by indexing into data frames with column labels. We called this direct indexing with column labels (DICL).

Simple indexing and the golden rule#

We will use the term simple indexing to cover both of DIBS and DICL.

Simple Indexing

Direct indexing with either Boolean Series (DIBS) or column labels (DICL).

Here follow some examples.

We start with our usual imports:

# Load the Numpy library, rename to "np"
import numpy as np

# Load the Pandas data science library, rename to "pd"
import pandas as pd

# Safe setting for Pandas.  Needs Pandas version >= 1.5.
pd.set_option('mode.copy_on_write', True)

We return to a version of our first data frame, containing statistics on development and gender from the World Bank.

This is the gender_stats.csv file. See the gender stats data page for some more detail.

Download that file to the same directory as this notebook, if you are running on your own computer.

# Load the data file as a data frame.
gender_data = pd.read_csv('gender_stats.csv')

# Show the first 10 rows
gender_data.head(10)

	country_name	country_code	fert_rate	gdp_us_billion	health_exp_per_cap	health_exp_pub	prim_ed_girls	mat_mort_ratio	population
0	Aruba	ABW	1.66325	NaN	NaN	NaN	48.721939	NaN	0.103744
1	Afghanistan	AFG	4.95450	19.961015	161.138034	2.834598	40.109708	444.00	32.715838
2	Angola	AGO	6.12300	111.936542	254.747970	2.447546	NaN	501.25	26.937545
3	Albania	ALB	1.76925	12.327586	574.202694	2.836021	47.201082	29.25	2.888280
4	Andorra	AND	NaN	3.197538	4421.224933	7.260281	47.123345	NaN	0.079547
5	United Arab Emirates	ARE	1.79300	375.027082	2202.407569	2.581168	48.789260	6.00	9.080299
6	Argentina	ARG	2.32800	550.980968	1148.256142	2.782216	48.915810	53.75	42.976675
7	Armenia	ARM	1.54550	10.885362	348.663884	1.916016	46.782180	27.25	2.904683
8	American Samoa	ASM	NaN	0.640500	NaN	NaN	NaN	NaN	0.055422
9	Antigua and Barbuda	ATG	2.08200	1.298213	1152.493656	3.676514	48.291463	NaN	0.098872

Direct Indexing with Column Labels (DICL)#

Direct Indexing with Column Labels (DICL) happens when we do direct indexing (square brackets immediately follow the data frame variable), where one or more column labels go inside the square brackets.

For example, here we use DICL to get the values for one column as a Pandas Series:

# DICL with a single column name
gender_data['mat_mort_ratio']

       NaN
    444.00
    501.25
     29.25
       NaN
        ...  
     NaN
  399.75
  143.75
  233.75
  398.00
Name: mat_mort_ratio, Length: 216, dtype: float64

We can use more than one column label by putting the column labels in a list:

# DICL with a two column names in a list.
cols_we_want = ['gdp_us_billion', 'mat_mort_ratio']
gender_data[cols_we_want]

	gdp_us_billion	mat_mort_ratio
0	NaN	NaN
1	19.961015	444.00
2	111.936542	501.25
3	12.327586	29.25
4	3.197538	NaN
...	...	...
211	6.804620	NaN
212	36.819337	399.75
213	345.209888	143.75
214	24.280990	233.75
215	15.495514	398.00

216 rows × 2 columns

# DICL with a two column names in a list.
# Here we create the list inside the indexing brackets, instead
# of first creating it and storing it in a variable.
# Notice the two uses of square brackets - following "gender_data" to
# do indexing, and around the column labels to create a list.
gender_data[['gdp_us_billion', 'mat_mort_ratio']]

	gdp_us_billion	mat_mort_ratio
0	NaN	NaN
1	19.961015	444.00
2	111.936542	501.25
3	12.327586	29.25
4	3.197538	NaN
...	...	...
211	6.804620	NaN
212	36.819337	399.75
213	345.209888	143.75
214	24.280990	233.75
215	15.495514	398.00

216 rows × 2 columns

DICL gives a single column (Series) with a single label, or a subset of columns (a new data frame) when we pass a list of column labels.

Direct Indexing with Boolean Series (DIBS)#

Direct Indexing with a Boolean Series (DIBS) happens when we put a Boolean Series within the indexing brackets.

DIBS selects a subset of the rows. Specifically, it selects the rows where the Boolean Series has True. It returns a subset of the original dataframe.

For example, let us say we wanted to select the rows in the gender_data data frame where the mat_mort_ratio value was greater than 100.

We first create a Boolean Series, with True for rows where mat_mort_ratio is greater than 100, and False otherwise.

# DICL to get the mat_mort_ratio column as a Series.
mmr = gender_data['mat_mort_ratio']
# Make a Boolean Series with True for > 100, False otherwise.
mmr_gt_100 = mmr > 100

We index with the Boolean Series (DIBS), to select only those rows where mat_mort_ratio was greater than 100. Notice that you get a data frame back.

# DIBS
gender_data[mmr_gt_100]

	country_name	country_code	fert_rate	gdp_us_billion	health_exp_per_cap	health_exp_pub	prim_ed_girls	mat_mort_ratio	population
1	Afghanistan	AFG	4.95450	19.961015	161.138034	2.834598	40.109708	444.00	32.715838
2	Angola	AGO	6.12300	111.936542	254.747970	2.447546	NaN	501.25	26.937545
13	Burundi	BDI	5.99225	2.876978	59.693830	4.433749	50.481971	747.25	9.907015
15	Benin	BEN	4.80675	8.778151	83.726190	2.206916	47.211127	417.50	10.293715
16	Burkina Faso	BFA	5.56450	11.753054	87.143009	3.021246	48.217520	384.25	17.597395
...	...	...	...	...	...	...	...	...	...
199	Uganda	UGA	5.82250	25.941461	132.892684	2.014349	50.099485	366.50	38.865339
212	Yemen, Rep.	YEM	4.22575	36.819337	207.949700	1.417836	44.470076	399.75	26.246608
213	South Africa	ZAF	2.37525	345.209888	1123.142656	4.241441	48.516298	143.75	54.177209
214	Zambia	ZMB	5.39425	24.280990	185.556359	2.687290	49.934484	233.75	15.633220
215	Zimbabwe	ZWE	3.94300	15.495514	115.519881	2.695188	49.529875	398.00	15.420964

72 rows × 9 columns

Here we created the Boolean Series in two steps, and then used it to do the indexing. It’s more common to create the Boolean Series with an expression inside the indexing brackets:

# DIBS, creating the BS with an expression inside the indexing brackets.
gender_data[gender_data['mat_mort_ratio'] > 100]

	country_name	country_code	fert_rate	gdp_us_billion	health_exp_per_cap	health_exp_pub	prim_ed_girls	mat_mort_ratio	population
1	Afghanistan	AFG	4.95450	19.961015	161.138034	2.834598	40.109708	444.00	32.715838
2	Angola	AGO	6.12300	111.936542	254.747970	2.447546	NaN	501.25	26.937545
13	Burundi	BDI	5.99225	2.876978	59.693830	4.433749	50.481971	747.25	9.907015
15	Benin	BEN	4.80675	8.778151	83.726190	2.206916	47.211127	417.50	10.293715
16	Burkina Faso	BFA	5.56450	11.753054	87.143009	3.021246	48.217520	384.25	17.597395
...	...	...	...	...	...	...	...	...	...
199	Uganda	UGA	5.82250	25.941461	132.892684	2.014349	50.099485	366.50	38.865339
212	Yemen, Rep.	YEM	4.22575	36.819337	207.949700	1.417836	44.470076	399.75	26.246608
213	South Africa	ZAF	2.37525	345.209888	1123.142656	4.241441	48.516298	143.75	54.177209
214	Zambia	ZMB	5.39425	24.280990	185.556359	2.687290	49.934484	233.75	15.633220
215	Zimbabwe	ZWE	3.94300	15.495514	115.519881	2.695188	49.529875	398.00	15.420964

72 rows × 9 columns

Notice that this is exactly the same as the indexing above; Python first sees and evaluates the expression inside the square brackets: gender_data['mat_mort_ratio'] > 100, to make the Boolean Series, and then passes the resulting series to the data frame indexing machinery to select the rows.

Indirect indexing#

The rest of this page covers more sophisticated indexing in Pandas - indirect indexing.

Why don’t we use the term direct indexing instead of simple indexing? Because there are various other things you can put inside the square brackets using direct indexing. We won’t go into other uses of direct indexing here, because we strongly recommend that you restrict yourself to DIBS and DICL for direct indexing — because other uses of direct indexing can quickly become confusing and ambiguous, and because you can do all the indexing you need using the indirect indexing techniques in this page. As you will see, indirect indexing uses the .loc and .ilocattributes of the DataFrame.

Here is the rule we propose.

The Golden Rule of Pandas Indexing

Only use Boolean Series or column labels for direct indexing. For any other type of indexing, use indirect indexing with .loc and .iloc.

We do not just advocate this rule for beginners. We (your humble authors) also stick to this rule, not only in this textbook, but in our daily work. By sticking to this rule, you will avoid some nasty and confusing errors, and you will find it easier to think clearly about indexing.

Indirect indexing#

To understand indirect indexing, we return to the index of data frames and Series. The index contains the row labels for a data frame or series.

We start with our usual imports:

# Load the Numpy library, rename to "np"
import numpy as np

# Load the Pandas data science library, rename to "pd"
import pandas as pd

# Safe setting for Pandas.  Needs Pandas version >= 1.5.
pd.set_option('mode.copy_on_write', True)

We return to a version of our first data frame, containing statistics on development and gender from the World Bank.

This is the gender_stats.csv file. See the gender stats data page for some more detail.

Download that file to the same directory as this notebook, if you are running on your own computer.

# Load the data file as a data frame.
gender_data = pd.read_csv('gender_stats.csv')

# Show the first 10 rows
gender_data.head(10)

	country_name	country_code	fert_rate	gdp_us_billion	health_exp_per_cap	health_exp_pub	prim_ed_girls	mat_mort_ratio	population
0	Aruba	ABW	1.66325	NaN	NaN	NaN	48.721939	NaN	0.103744
1	Afghanistan	AFG	4.95450	19.961015	161.138034	2.834598	40.109708	444.00	32.715838
2	Angola	AGO	6.12300	111.936542	254.747970	2.447546	NaN	501.25	26.937545
3	Albania	ALB	1.76925	12.327586	574.202694	2.836021	47.201082	29.25	2.888280
4	Andorra	AND	NaN	3.197538	4421.224933	7.260281	47.123345	NaN	0.079547
5	United Arab Emirates	ARE	1.79300	375.027082	2202.407569	2.581168	48.789260	6.00	9.080299
6	Argentina	ARG	2.32800	550.980968	1148.256142	2.782216	48.915810	53.75	42.976675
7	Armenia	ARM	1.54550	10.885362	348.663884	1.916016	46.782180	27.25	2.904683
8	American Samoa	ASM	NaN	0.640500	NaN	NaN	NaN	NaN	0.055422
9	Antigua and Barbuda	ATG	2.08200	1.298213	1152.493656	3.676514	48.291463	NaN	0.098872

As you know, the names at the top, in bold typeface, are the names of the columns. We can see these names using the columns attribute of the data frame:

gender_data.columns

Index(['country_name', 'country_code', 'fert_rate', 'gdp_us_billion',
       'health_exp_per_cap', 'health_exp_pub', 'prim_ed_girls',
       'mat_mort_ratio', 'population'],
      dtype='object')

These can also be called the column labels.

Here is a summary of the meaning of most of the columns:

fert_rate: Fertility rate, total (births per woman).
gdp_us_billion: GDP (in current US $ billions).
health_exp_per_cap: Health expenditure per capita, PPP (constant 2011 international $).
health_exp_pub: Health expenditure, public (% of GDP).
prim_ed_girls: Primary education, pupils (% female).
mat_mort_ratio: Maternal mortality ratio (modeled estimate, per 100,000 live births).
population: Population, total (millions).

We have not paid attention so far to the numbers at the left of the table display, also in bold typeface.

In this case, they are ascending integers, starting at 0, with value 0 for the first row, 1 for the second row, and so on.

These numbers are the index of the data frame. The index contains the row labels.

We can see the row labels / index with the index attribute of the data frame.

gender_data.index

RangeIndex(start=0, stop=216, step=1)

This is saying we have a simple index, that is like a np.arange, starting at 0, and continuing in steps of 1 up to, but not including 216.

So far, the index looks like row numbers, with 0 for the first row, 1 for the second row. We will soon see that the index is more than that.

For example, let us look at the last 15 rows. We can get these with the tail method of the data frame.

last_15 = gender_data.tail(15)
last_15

	country_name	country_code	fert_rate	gdp_us_billion	health_exp_per_cap	health_exp_pub	prim_ed_girls	mat_mort_ratio	population
201	Uruguay	URY	2.027000	54.345132	1721.507752	6.044403	48.295555	15.50	3.419977
202	United States	USA	1.860875	17369.124600	9060.068657	8.121961	48.758830	14.00	318.558175
203	Uzbekistan	UZB	2.372750	61.340649	334.476754	3.118842	48.387434	37.00	30.784500
204	St. Vincent and the Grenadines	VCT	1.986000	0.730107	775.803386	4.365757	48.536415	45.75	0.109421
205	Venezuela, RB	VEN	2.378250	376.146268	896.815314	1.587088	48.400934	97.00	30.734524
206	British Virgin Islands	VGB	NaN	NaN	NaN	NaN	47.581520	NaN	0.029585
207	Virgin Islands (U.S.)	VIR	1.760000	3.812000	NaN	NaN	NaN	NaN	0.104141
208	Vietnam	VNM	1.959500	181.820736	368.374550	3.779501	48.021053	54.75	90.742400
209	Vanuatu	VUT	3.364750	0.782876	125.568712	3.689874	47.301617	82.50	0.258896
210	Samoa	WSM	4.118250	0.799887	366.353096	5.697059	48.350049	54.75	0.192225
211	Kosovo	XKX	2.142500	6.804620	NaN	NaN	NaN	NaN	1.813820
212	Yemen, Rep.	YEM	4.225750	36.819337	207.949700	1.417836	44.470076	399.75	26.246608
213	South Africa	ZAF	2.375250	345.209888	1123.142656	4.241441	48.516298	143.75	54.177209
214	Zambia	ZMB	5.394250	24.280990	185.556359	2.687290	49.934484	233.75	15.633220
215	Zimbabwe	ZWE	3.943000	15.495514	115.519881	2.695188	49.529875	398.00	15.420964

Here is the index of this new data frame.

last_15.index

RangeIndex(start=201, stop=216, step=1)

Notice that the first row of this new data frame no longer has the label 0 - it has label 201. These numbers are the labels from the original data frame, from which we selected the rows.

Notice too that the second row in this table is the USA, and it has the label \202.

Now imagine that I sort the original data frame by Gross Domestic Product, and select the top 15 rows. By doing this, I select the top 15 richest countries.

# Sort by GDP.
df_by_gdp = gender_data.sort_values('gdp_us_billion', ascending=False)
# Take the top 15 rows.
richest_15 = df_by_gdp.head(15)
richest_15

	country_name	country_code	fert_rate	gdp_us_billion	health_exp_per_cap	health_exp_pub	prim_ed_girls	mat_mort_ratio	population
202	United States	USA	1.860875	17369.124600	9060.068657	8.121961	48.758830	14.00	318.558175
35	China	CHN	1.558750	10182.790479	657.748859	3.015530	46.297964	28.75	1364.446000
97	Japan	JPN	1.430000	5106.024760	3687.126279	8.496074	48.744199	5.75	127.297102
49	Germany	DEU	1.450000	3601.226158	4909.659884	8.542615	48.568695	6.25	81.281645
67	United Kingdom	GBR	1.842500	2768.864417	3357.983675	7.720655	48.791809	9.25	64.641557
63	France	FRA	2.005000	2647.649725	4387.835406	8.920420	48.772050	8.75	66.302099
26	Brazil	BRA	1.795250	2198.765606	1303.199104	3.773473	47.784577	49.50	204.159544
88	India	IND	2.449250	2019.005411	241.572477	1.292666	49.497234	185.25	1293.742537
94	Italy	ITA	1.390000	2005.983980	3266.984094	6.984374	48.407573	4.00	60.378795
164	Russian Federation	RUS	1.724500	1822.691700	1755.506635	3.731354	48.968070	25.25	143.793504
32	Canada	CAN	1.600300	1708.473627	4616.539134	7.546247	48.808926	7.25	35.517119
10	Australia	AUS	1.861500	1422.994116	4256.058988	6.292381	48.576707	6.00	23.444560
104	Korea, Rep.	KOR	1.232000	1346.751162	2385.447251	3.915606	48.023388	12.00	50.727212
58	Spain	ESP	1.307500	1299.724261	2963.832825	6.545739	48.722231	5.00	46.553128
124	Mexico	MEX	2.257000	1188.802780	1081.208948	3.225839	48.906296	40.00	124.203450

Now notice that the index values have followed their respective rows. The row for the USA is now first, but it has kept its original index value (row label) - 202. The index value identifies the row.

Here is the index for this small table:

richest_15.index

Index([202, 35, 97, 49, 67, 63, 26, 88, 94, 164, 32, 10, 104, 58, 124], dtype='int64')

Now let us say that I wanted to get the eleventh row of this table.

I would be tempted to do something like richest_15[10] - but then - what would Pandas give me? Does 10 mean the row label for “Australia”? Or does it mean the eleventh row of the table, as it would for an array, and therefore the row for “Canada”? That is, will the indexing using the labels (index values)? Or will it use the positions (as for indexing in Numpy arrays)?

The answer is, it is ambiguous, and when things are ambiguous, Pandas has to guess. That’s a bad idea, because then we have to guess what Pandas has guessed, and it is starting to get really confusing.

For that reason, we careful Pandas users do not index rows with code like richest_15[10]. When we are not using Booleans to select rows, we always use one of the two indexing attributes that Pandas provides. Each does a different type of indexing.

The loc (label locate) attribute indexes using the labels (row index).
The iloc (position locate) attribute indexes using the positions.

Here they are in action:

# Get the row with label 10
richest_15.loc[10]

country_name            Australia
country_code                  AUS
fert_rate                  1.8615
gdp_us_billion        1422.994116
health_exp_per_cap    4256.058988
health_exp_pub           6.292381
prim_ed_girls           48.576707
mat_mort_ratio                6.0
population               23.44456
Name: 10, dtype: object

# Get the row at position / offset 10
richest_15.iloc[10]

country_name               Canada
country_code                  CAN
fert_rate                  1.6003
gdp_us_billion        1708.473627
health_exp_per_cap    4616.539134
health_exp_pub           7.546247
prim_ed_girls           48.808926
mat_mort_ratio               7.25
population              35.517119
Name: 32, dtype: object

In more details, what is happening in the first cell above is that richest_15.loc gives us a special value (AKA object) that:

knows about the richest_15 data frame;
will give us rows and columns, if we use [ something ] on it.

So, read richest_15.loc[10] as:

Get the richest_15 value (the data frame);
Get the loc attribute of this data frame. This evaluates to a special object (value) that can give us rows and columns from the data frame, by looking at labels;
Use this object to fetch the row labeled 10, by using indexing ([10])

Likewise, read richest_15.iloc[10] as:

Get the richest_15 value (the data frame);
Get the iloc attribute of this data frame. This evaluates to a special object (value) that can give us rows and columns from the data frame, by looking at positions (offsets);
Use this object to fetch the row at offset 10, by using indexing ([10]).

When you select rows and columns from a data frame, consider carefully what you want.

Do you want to select rows and columns by label (by index, in the rows). Then you want .loc.

Do you want to select rows and columns by position (offset). Then you want .iloc.

indexing and slices#

.loc indexing gets you rows and columns by label.

One form of indexing you have seen for arrays, is labeling by slice, to get a sequence of values.

A slice is an expression with a colon (:) in it somewhere. Here’s slice indexing into an array:

my_array = np.array(['one', 'two', 'three', 'four', 'five'])
# Give array of values from position 2 up to, not including position 4
my_array[2:4]

array(['three', 'four'], dtype='<U5')

The same sort of indexing works using both .iloc and .loc. For .iloc, the parallel is obvious - because we specify positions in exactly the same way:

# Rows from position 2, up to, not including, position 4
richest_15.iloc[2:4]

	country_name	country_code	fert_rate	gdp_us_billion	health_exp_per_cap	health_exp_pub	prim_ed_girls	mat_mort_ratio	population
97	Japan	JPN	1.43	5106.024760	3687.126279	8.496074	48.744199	5.75	127.297102
49	Germany	DEU	1.45	3601.226158	4909.659884	8.542615	48.568695	6.25	81.281645

You can also do this kind of indexing with .loc, but it works differently. To make this clear, let’s start with .loc indexing where the row labels are strings. We set the row labels to be the country codes.

richest_15_with_code = richest_15.set_index('country_code')
richest_15_with_code

	country_name	fert_rate	gdp_us_billion	health_exp_per_cap	health_exp_pub	prim_ed_girls	mat_mort_ratio	population
country_code
USA	United States	1.860875	17369.124600	9060.068657	8.121961	48.758830	14.00	318.558175
CHN	China	1.558750	10182.790479	657.748859	3.015530	46.297964	28.75	1364.446000
JPN	Japan	1.430000	5106.024760	3687.126279	8.496074	48.744199	5.75	127.297102
DEU	Germany	1.450000	3601.226158	4909.659884	8.542615	48.568695	6.25	81.281645
GBR	United Kingdom	1.842500	2768.864417	3357.983675	7.720655	48.791809	9.25	64.641557
FRA	France	2.005000	2647.649725	4387.835406	8.920420	48.772050	8.75	66.302099
BRA	Brazil	1.795250	2198.765606	1303.199104	3.773473	47.784577	49.50	204.159544
IND	India	2.449250	2019.005411	241.572477	1.292666	49.497234	185.25	1293.742537
ITA	Italy	1.390000	2005.983980	3266.984094	6.984374	48.407573	4.00	60.378795
RUS	Russian Federation	1.724500	1822.691700	1755.506635	3.731354	48.968070	25.25	143.793504
CAN	Canada	1.600300	1708.473627	4616.539134	7.546247	48.808926	7.25	35.517119
AUS	Australia	1.861500	1422.994116	4256.058988	6.292381	48.576707	6.00	23.444560
KOR	Korea, Rep.	1.232000	1346.751162	2385.447251	3.915606	48.023388	12.00	50.727212
ESP	Spain	1.307500	1299.724261	2963.832825	6.545739	48.722231	5.00	46.553128
MEX	Mexico	2.257000	1188.802780	1081.208948	3.225839	48.906296	40.00	124.203450

As we would expect, we will get an error if we try to use numbers for .loc row indexing, because .loc indexing only uses labels, and the labels are now all strings:

richest_15_with_code.loc[2:4]

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[22], line 1
----> 1 richest_15_with_code.loc[2:4]

File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/indexing.py:1191, in _LocationIndexer.__getitem__(self, key)
   1189 maybe_callable = com.apply_if_callable(key, self.obj)
   1190 maybe_callable = self._check_deprecated_callable_usage(key, maybe_callable)
-> 1191 return self._getitem_axis(maybe_callable, axis=axis)

File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/indexing.py:1411, in _LocIndexer._getitem_axis(self, key, axis)
   1409 if isinstance(key, slice):
   1410     self._validate_key(key, axis)
-> 1411     return self._get_slice_axis(key, axis=axis)
   1412 elif com.is_bool_indexer(key):
   1413     return self._getbool_axis(key, axis=axis)

File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/indexing.py:1443, in _LocIndexer._get_slice_axis(self, slice_obj, axis)
   1440     return obj.copy(deep=False)
   1442 labels = obj._get_axis(axis)
-> 1443 indexer = labels.slice_indexer(slice_obj.start, slice_obj.stop, slice_obj.step)
   1445 if isinstance(indexer, slice):
   1446     return self.obj._slice(indexer, axis=axis)

File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/indexes/base.py:6662, in Index.slice_indexer(self, start, end, step)
   6618 def slice_indexer(
   6619     self,
   6620     start: Hashable | None = None,
   6621     end: Hashable | None = None,
   6622     step: int | None = None,
   6623 ) -> slice:
   6624     """
   6625     Compute the slice indexer for input labels and step.
   6626 
   (...)
   6660     slice(1, 3, None)
   6661     """
-> 6662     start_slice, end_slice = self.slice_locs(start, end, step=step)
   6664     # return a slice
   6665     if not is_scalar(start_slice):

File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/indexes/base.py:6879, in Index.slice_locs(self, start, end, step)
   6877 start_slice = None
   6878 if start is not None:
-> 6879     start_slice = self.get_slice_bound(start, "left")
   6880 if start_slice is None:
   6881     start_slice = 0

File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/indexes/base.py:6794, in Index.get_slice_bound(self, label, side)
   6790 original_label = label
   6792 # For datetime indices label may be a string that has to be converted
   6793 # to datetime boundary according to its resolution.
-> 6794 label = self._maybe_cast_slice_bound(label, side)
   6796 # we need to look up the label
   6797 try:

File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/indexes/base.py:6727, in Index._maybe_cast_slice_bound(self, label, side)
   6725 # reject them, if index does not contain label
   6726 if (is_float(label) or is_integer(label)) and label not in self:
-> 6727     self._raise_invalid_indexer("slice", label)
   6729 return label

File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/indexes/base.py:4301, in Index._raise_invalid_indexer(self, form, key, reraise)
   4299 if reraise is not lib.no_default:
   4300     raise TypeError(msg) from reraise
-> 4301 raise TypeError(msg)

TypeError: cannot do slice indexing on Index with these indexers [2] of type int

In fact we can use slices, but the value in front of the colon (if any) is the label of the first row we want, and the value after the colon is the label of the last row we want.

Notice that this is different from position indexing for arrays, or using .iloc. In those cases, the value after the colon gives the stop position. We return values from positions up to but not including the stop position. As a reminder:

# From position 2 up to, but not including, position 4
richest_15_with_code.iloc[2:4]

	country_name	fert_rate	gdp_us_billion	health_exp_per_cap	health_exp_pub	prim_ed_girls	mat_mort_ratio	population
country_code
JPN	Japan	1.43	5106.024760	3687.126279	8.496074	48.744199	5.75	127.297102
DEU	Germany	1.45	3601.226158	4909.659884	8.542615	48.568695	6.25	81.281645

Now consider this .loc indexing with a slice. Here the value after the colon is not a stop value, but an end label. We do return the row corresponding to the end label:

# All countries starting at label 'JPN' and ending (including)
# row labeled 'FRA'
richest_15_with_code.loc['JPN':'FRA']

	country_name	fert_rate	gdp_us_billion	health_exp_per_cap	health_exp_pub	prim_ed_girls	mat_mort_ratio	population
country_code
JPN	Japan	1.4300	5106.024760	3687.126279	8.496074	48.744199	5.75	127.297102
DEU	Germany	1.4500	3601.226158	4909.659884	8.542615	48.568695	6.25	81.281645
GBR	United Kingdom	1.8425	2768.864417	3357.983675	7.720655	48.791809	9.25	64.641557
FRA	France	2.0050	2647.649725	4387.835406	8.920420	48.772050	8.75	66.302099

For reflection - why did Pandas choose to make label indexing work this way? Have a look at the example above.

Loc and iloc indexing work on columns too#

Consider this. We use .loc and the label for the row, followed by the label for the column.

richest_15.loc[10, 'gdp_us_billion']

1422.994116449394

This selects the row labeled 10 (the USA row) and returns the value from the gdp_us_billion column.

Conversely, iloc only understands positions, and so, integers. It does not understand or use labels:

# We can't specify columns by label, using iloc.
richest_15.iloc[10, 'gdp_us_billion']

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/indexing.py:966, in _LocationIndexer._validate_tuple_indexer(self, key)
    965 try:
--> 966     self._validate_key(k, i)
    967 except ValueError as err:

File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/indexing.py:1614, in _iLocIndexer._validate_key(self, key, axis)
   1613 else:
-> 1614     raise ValueError(f"Can only index by location with a [{self._valid_types}]")

ValueError: Can only index by location with a [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array]

The above exception was the direct cause of the following exception:

ValueError                                Traceback (most recent call last)
Cell In[26], line 2
      1 # We can't specify columns by label, using iloc.
----> 2 richest_15.iloc[10, 'gdp_us_billion']

File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/indexing.py:1184, in _LocationIndexer.__getitem__(self, key)
   1182     if self._is_scalar_access(key):
   1183         return self.obj._get_value(*key, takeable=self._takeable)
-> 1184     return self._getitem_tuple(key)
   1185 else:
   1186     # we by definition only have the 0th axis
   1187     axis = self.axis or 0

File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/indexing.py:1690, in _iLocIndexer._getitem_tuple(self, tup)
   1689 def _getitem_tuple(self, tup: tuple):
-> 1690     tup = self._validate_tuple_indexer(tup)
   1691     with suppress(IndexingError):
   1692         return self._getitem_lowerdim(tup)

File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/indexing.py:968, in _LocationIndexer._validate_tuple_indexer(self, key)
    966         self._validate_key(k, i)
    967     except ValueError as err:
--> 968         raise ValueError(
    969             "Location based indexing can only have "
    970             f"[{self._valid_types}] types"
    971         ) from err
    972 return key

ValueError: Location based indexing can only have [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array] types

Notice the somewhat informative error right at the bottom of the error message.

iloc can get columns by their position. For example, fertility_rate is the third column, and so we can get the fertility rate by asking iloc for the eleventh row, third column, like so:

richest_15.iloc[10, 2]

1.6002999999999998

This is the value from the row at offset 10 and the column at offset 2.

Series have an index too#

Data frames have columns (and column labels), and rows (with row labels — the index).

Series also have element labels — they also have an index.

Consider this column, of the fertility rates of the richest countries:

# Get the fert_rate column from the data frame
rich_ferts = richest_15['fert_rate']
rich_ferts.head()

  1.860875
   1.558750
   1.430000
   1.450000
   1.842500
Name: fert_rate, dtype: float64

As y’all know, this is a Series:

type(rich_ferts)

pandas.core.series.Series

You probably also spotted the numbers to the left of the values. Yes, the row labels have come across from the data frame into the Series. We can see that the first element has row label 202, and we remember that this is the row label (index value) for the USA.

As for the data frames, here is the index for the series:

rich_ferts.index

Index([202, 35, 97, 49, 67, 63, 26, 88, 94, 164, 32, 10, 104, 58, 124], dtype='int64')

Like data frames, we can index by label, with loc, and by position, with iloc:

# The value for row labeled 10 - Australia.
rich_ferts.loc[10]

1.8615

# The value for the row at position 10 - Canada
rich_ferts.iloc[10]

1.6002999999999998

Rows from data frames are Series#

Here we select the row at position 0 (the first row):

first_row = richest_15.iloc[0]
first_row

country_name          United States
country_code                    USA
fert_rate                  1.860875
gdp_us_billion           17369.1246
health_exp_per_cap      9060.068657
health_exp_pub             8.121961
prim_ed_girls              48.75883
mat_mort_ratio                 14.0
population               318.558175
Name: 202, dtype: object

This is a Series:

type(first_row)

pandas.core.series.Series

Notice that the Series has 9 elements, one for each column in the data frame.

As you remember, we can see the number of rows and columns in a data frame with:

richest_15.shape

(15, 9)

Remember, the columns have names, which we can also call labels.

When we select a row out of a data frame, we have one element per column, and the labels for the columns now become the labels for the rows.

first_row.index

Index(['country_name', 'country_code', 'fert_rate', 'gdp_us_billion',
       'health_exp_per_cap', 'health_exp_pub', 'prim_ed_girls',
       'mat_mort_ratio', 'population'],
      dtype='object')

As for any other Series, we can get elements by label, with loc:

first_row.loc['fert_rate']

1.860875

We can also get elements by position (offset), with iloc:

first_row.iloc[0]

'United States'

You can index without loc and iloc, but please don’t, except …#

In the pages following the data frame introduction, you saw that we can index into data frame directly, using [ something ] after the data frame value, without using loc or iloc. Call this direct indexing. This is a very reasonable thing to do if:

the something is a Boolean Series (DIBS). This is not ambiguous.
the something is a column label or list of labels (DICL). This isn’t ambiguous either.

Both of these are entirely reasonable uses of direct indexing:

# Direct indexing with a Boolean Series (DIBS).
# Select rows where 'health_exp_per_cap' is greater than 3000 dollars.
is_big_spender = richest_15['health_exp_per_cap'] > 3000  # Boolean Series
big_health_spenders = richest_15[is_big_spender]
big_health_spenders

	country_name	country_code	fert_rate	gdp_us_billion	health_exp_per_cap	health_exp_pub	prim_ed_girls	mat_mort_ratio	population
202	United States	USA	1.860875	17369.124600	9060.068657	8.121961	48.758830	14.00	318.558175
97	Japan	JPN	1.430000	5106.024760	3687.126279	8.496074	48.744199	5.75	127.297102
49	Germany	DEU	1.450000	3601.226158	4909.659884	8.542615	48.568695	6.25	81.281645
67	United Kingdom	GBR	1.842500	2768.864417	3357.983675	7.720655	48.791809	9.25	64.641557
63	France	FRA	2.005000	2647.649725	4387.835406	8.920420	48.772050	8.75	66.302099
94	Italy	ITA	1.390000	2005.983980	3266.984094	6.984374	48.407573	4.00	60.378795
32	Canada	CAN	1.600300	1708.473627	4616.539134	7.546247	48.808926	7.25	35.517119
10	Australia	AUS	1.861500	1422.994116	4256.058988	6.292381	48.576707	6.00	23.444560

# Direct indexing with column labels (DICL).
# Select the 'country_code' column.
country_code = richest_15['country_code']
country_code

  USA
   CHN
   JPN
   DEU
   GBR
   FRA
   BRA
   IND
   ITA
  RUS
   CAN
   AUS
  KOR
   ESP
  MEX
Name: country_code, dtype: object

Those uses of direct indexing are entirely sensible. Other uses of direct indexing are not sensible, and we strongly encourage you not to use them.

For example, Pandas will also allow you to this, with the Series we just made:

# Nooooooo !
country_code[10]

'AUS'

Please don’t do that. This is ambiguous. Did you really mean the country with label 10? Or did you mean the country at position 10? Pandas will guess for you. Don’t let it. If you are not selecting rows / elements with Booleans, always do this:

# By label
country_code.loc[10]

'AUS'

or this:

# By position
country_code.iloc[10]

'CAN'

You can use Boolean indexing with `.loc`#

It is often convenient to use direct indexing with Booleans, as we have just seen:

# Create a Boolean series with True for big spender rows, False otherwise.
is_big_spender = richest_15['health_exp_per_cap'] > 3000
is_big_spender

   True
   False
    True
    True
    True
    True
   False
   False
    True
  False
    True
    True
  False
   False
  False
Name: health_exp_per_cap, dtype: bool

# Direct Boolean indexing:
# Select rows where 'health_exp_per_cap' is greater than 3000 dollars.
big_health_spenders = richest_15[is_big_spender]
big_health_spenders

	country_name	country_code	fert_rate	gdp_us_billion	health_exp_per_cap	health_exp_pub	prim_ed_girls	mat_mort_ratio	population
202	United States	USA	1.860875	17369.124600	9060.068657	8.121961	48.758830	14.00	318.558175
97	Japan	JPN	1.430000	5106.024760	3687.126279	8.496074	48.744199	5.75	127.297102
49	Germany	DEU	1.450000	3601.226158	4909.659884	8.542615	48.568695	6.25	81.281645
67	United Kingdom	GBR	1.842500	2768.864417	3357.983675	7.720655	48.791809	9.25	64.641557
63	France	FRA	2.005000	2647.649725	4387.835406	8.920420	48.772050	8.75	66.302099
94	Italy	ITA	1.390000	2005.983980	3266.984094	6.984374	48.407573	4.00	60.378795
32	Canada	CAN	1.600300	1708.473627	4616.539134	7.546247	48.808926	7.25	35.517119
10	Australia	AUS	1.861500	1422.994116	4256.058988	6.292381	48.576707	6.00	23.444560

We can also use Booleans with loc indexing. This allows us to do things such as selecting rows with Booleans, and columns with labels, in one shot, like this:

# Boolean indexing for rows, with "loc"
# Get "country_name" column values for big health spenders:
big_spender_names = richest_15.loc[is_big_spender, 'country_name']
big_spender_names

   United States
            Japan
          Germany
   United Kingdom
           France
            Italy
           Canada
        Australia
Name: country_name, dtype: object

In fact you can use some Boolean sequences with iloc as well, but that needs a little more explanation.

Indexing in Pandas

Contents

Indexing in Pandas#

Simple indexing and the golden rule#

Direct Indexing with Column Labels (DICL)#

Direct Indexing with Boolean Series (DIBS)#

Indirect indexing#

Indirect indexing#

indexing and slices#

Loc and iloc indexing work on columns too#

Series have an index too#

Rows from data frames are Series#

You can index without loc and iloc, but please don’t, except …#

You can use Boolean indexing with .loc#

You can use Boolean indexing with `.loc`#