Indexing in Pandas#

Way back in the basic row indexing, we showed row selection by indexing into data frames with Boolean vectors, such as Pandas Series. We called this direct indexing with Boolean Series (DIBS). We also covered selecting columns by indexing into data frames with column labels. We called this direct indexing with column labels (DICL).

Simple indexing and the golden rule#

We will use the term simple indexing to cover both of DIBS and DICL.

Simple Indexing

Direct indexing with either Boolean Series (DIBS) or column labels (DICL).

Here follow some examples.

We start with our usual imports:

# Load the Numpy library, rename to "np"
import numpy as np

# Load the Pandas data science library, rename to "pd"
import pandas as pd

# Safe setting for Pandas.  Needs Pandas version >= 1.5.
pd.set_option('mode.copy_on_write', True)

We return to a version of our first data frame, containing statistics on development and gender from the World Bank.

This is the gender_stats.csv file. See the gender stats data page for some more detail.

Download that file to the same directory as this notebook, if you are running on your own computer.

# Load the data file as a data frame.
gender_data = pd.read_csv('gender_stats.csv')

# Show the first 10 rows
gender_data.head(10)
country_name country_code fert_rate gdp_us_billion health_exp_per_cap health_exp_pub prim_ed_girls mat_mort_ratio population
0 Aruba ABW 1.66325 NaN NaN NaN 48.721939 NaN 0.103744
1 Afghanistan AFG 4.95450 19.961015 161.138034 2.834598 40.109708 444.00 32.715838
2 Angola AGO 6.12300 111.936542 254.747970 2.447546 NaN 501.25 26.937545
3 Albania ALB 1.76925 12.327586 574.202694 2.836021 47.201082 29.25 2.888280
4 Andorra AND NaN 3.197538 4421.224933 7.260281 47.123345 NaN 0.079547
5 United Arab Emirates ARE 1.79300 375.027082 2202.407569 2.581168 48.789260 6.00 9.080299
6 Argentina ARG 2.32800 550.980968 1148.256142 2.782216 48.915810 53.75 42.976675
7 Armenia ARM 1.54550 10.885362 348.663884 1.916016 46.782180 27.25 2.904683
8 American Samoa ASM NaN 0.640500 NaN NaN NaN NaN 0.055422
9 Antigua and Barbuda ATG 2.08200 1.298213 1152.493656 3.676514 48.291463 NaN 0.098872

Direct Indexing with Column Labels (DICL)#

Direct Indexing with Column Labels (DICL) happens when we do direct indexing (square brackets immediately follow the data frame variable), where one or more column labels go inside the square brackets.

For example, here we use DICL to get the values for one column as a Pandas Series:

# DICL with a single column name
gender_data['mat_mort_ratio']
0         NaN
1      444.00
2      501.25
3       29.25
4         NaN
        ...  
211       NaN
212    399.75
213    143.75
214    233.75
215    398.00
Name: mat_mort_ratio, Length: 216, dtype: float64

We can use more than one column label by putting the column labels in a list:

# DICL with a two column names in a list.
cols_we_want = ['gdp_us_billion', 'mat_mort_ratio']
gender_data[cols_we_want]
gdp_us_billion mat_mort_ratio
0 NaN NaN
1 19.961015 444.00
2 111.936542 501.25
3 12.327586 29.25
4 3.197538 NaN
... ... ...
211 6.804620 NaN
212 36.819337 399.75
213 345.209888 143.75
214 24.280990 233.75
215 15.495514 398.00

216 rows × 2 columns

# DICL with a two column names in a list.
# Here we create the list inside the indexing brackets, instead
# of first creating it and storing it in a variable.
# Notice the two uses of square brackets - following "gender_data" to
# do indexing, and around the column labels to create a list.
gender_data[['gdp_us_billion', 'mat_mort_ratio']]
gdp_us_billion mat_mort_ratio
0 NaN NaN
1 19.961015 444.00
2 111.936542 501.25
3 12.327586 29.25
4 3.197538 NaN
... ... ...
211 6.804620 NaN
212 36.819337 399.75
213 345.209888 143.75
214 24.280990 233.75
215 15.495514 398.00

216 rows × 2 columns

DICL gives a single column (Series) with a single label, or a subset of columns (a new data frame) when we pass a list of column labels.

Direct Indexing with Boolean Series (DIBS)#

Direct Indexing with a Boolean Series (DIBS) happens when we put a Boolean Series within the indexing brackets.

DIBS selects a subset of the rows. Specifically, it selects the rows where the Boolean Series has True. It returns a subset of the original dataframe.

For example, let us say we wanted to select the rows in the gender_data data frame where the mat_mort_ratio value was greater than 100.

We first create a Boolean Series, with True for rows where mat_mort_ratio is greater than 100, and False otherwise.

# DICL to get the mat_mort_ratio column as a Series.
mmr = gender_data['mat_mort_ratio']
# Make a Boolean Series with True for > 100, False otherwise.
mmr_gt_100 = mmr > 100

We index with the Boolean Series (DIBS), to select only those rows where mat_mort_ratio was greater than 100. Notice that you get a data frame back.

# DIBS
gender_data[mmr_gt_100]
country_name country_code fert_rate gdp_us_billion health_exp_per_cap health_exp_pub prim_ed_girls mat_mort_ratio population
1 Afghanistan AFG 4.95450 19.961015 161.138034 2.834598 40.109708 444.00 32.715838
2 Angola AGO 6.12300 111.936542 254.747970 2.447546 NaN 501.25 26.937545
13 Burundi BDI 5.99225 2.876978 59.693830 4.433749 50.481971 747.25 9.907015
15 Benin BEN 4.80675 8.778151 83.726190 2.206916 47.211127 417.50 10.293715
16 Burkina Faso BFA 5.56450 11.753054 87.143009 3.021246 48.217520 384.25 17.597395
... ... ... ... ... ... ... ... ... ...
199 Uganda UGA 5.82250 25.941461 132.892684 2.014349 50.099485 366.50 38.865339
212 Yemen, Rep. YEM 4.22575 36.819337 207.949700 1.417836 44.470076 399.75 26.246608
213 South Africa ZAF 2.37525 345.209888 1123.142656 4.241441 48.516298 143.75 54.177209
214 Zambia ZMB 5.39425 24.280990 185.556359 2.687290 49.934484 233.75 15.633220
215 Zimbabwe ZWE 3.94300 15.495514 115.519881 2.695188 49.529875 398.00 15.420964

72 rows × 9 columns

Here we created the Boolean Series in two steps, and then used it to do the indexing. It’s more common to create the Boolean Series with an expression inside the indexing brackets:

# DIBS, creating the BS with an expression inside the indexing brackets.
gender_data[gender_data['mat_mort_ratio'] > 100]
country_name country_code fert_rate gdp_us_billion health_exp_per_cap health_exp_pub prim_ed_girls mat_mort_ratio population
1 Afghanistan AFG 4.95450 19.961015 161.138034 2.834598 40.109708 444.00 32.715838
2 Angola AGO 6.12300 111.936542 254.747970 2.447546 NaN 501.25 26.937545
13 Burundi BDI 5.99225 2.876978 59.693830 4.433749 50.481971 747.25 9.907015
15 Benin BEN 4.80675 8.778151 83.726190 2.206916 47.211127 417.50 10.293715
16 Burkina Faso BFA 5.56450 11.753054 87.143009 3.021246 48.217520 384.25 17.597395
... ... ... ... ... ... ... ... ... ...
199 Uganda UGA 5.82250 25.941461 132.892684 2.014349 50.099485 366.50 38.865339
212 Yemen, Rep. YEM 4.22575 36.819337 207.949700 1.417836 44.470076 399.75 26.246608
213 South Africa ZAF 2.37525 345.209888 1123.142656 4.241441 48.516298 143.75 54.177209
214 Zambia ZMB 5.39425 24.280990 185.556359 2.687290 49.934484 233.75 15.633220
215 Zimbabwe ZWE 3.94300 15.495514 115.519881 2.695188 49.529875 398.00 15.420964

72 rows × 9 columns

Notice that this is exactly the same as the indexing above; Python first sees and evaluates the expression inside the square brackets: gender_data['mat_mort_ratio'] > 100, to make the Boolean Series, and then passes the resulting series to the data frame indexing machinery to select the rows.

Indirect indexing#

The rest of this page covers more sophisticated indexing in Pandas - indirect indexing.

Why don’t we use the term direct indexing instead of simple indexing? Because there are various other things you can put inside the square brackets using direct indexing. We won’t go into other uses of direct indexing here, because we strongly recommend that you restrict yourself to DIBS and DICL for direct indexing — because other uses of direct indexing can quickly become confusing and ambiguous, and because you can do all the indexing you need using the indirect indexing techniques in this page. As you will see, indirect indexing uses the .loc and .ilocattributes of the DataFrame.

Here is the rule we propose.

The Golden Rule of Pandas Indexing

Only use Boolean Series or column labels for direct indexing. For any other type of indexing, use indirect indexing with .loc and .iloc.

We do not just advocate this rule for beginners. We (your humble authors) also stick to this rule, not only in this textbook, but in our daily work. By sticking to this rule, you will avoid some nasty and confusing errors, and you will find it easier to think clearly about indexing.

Indirect indexing#

To understand indirect indexing, we return to the index of data frames and Series. The index contains the row labels for a data frame or series.

We start with our usual imports:

# Load the Numpy library, rename to "np"
import numpy as np

# Load the Pandas data science library, rename to "pd"
import pandas as pd

# Safe setting for Pandas.  Needs Pandas version >= 1.5.
pd.set_option('mode.copy_on_write', True)

We return to a version of our first data frame, containing statistics on development and gender from the World Bank.

This is the gender_stats.csv file. See the gender stats data page for some more detail.

Download that file to the same directory as this notebook, if you are running on your own computer.

# Load the data file as a data frame.
gender_data = pd.read_csv('gender_stats.csv')

# Show the first 10 rows
gender_data.head(10)
country_name country_code fert_rate gdp_us_billion health_exp_per_cap health_exp_pub prim_ed_girls mat_mort_ratio population
0 Aruba ABW 1.66325 NaN NaN NaN 48.721939 NaN 0.103744
1 Afghanistan AFG 4.95450 19.961015 161.138034 2.834598 40.109708 444.00 32.715838
2 Angola AGO 6.12300 111.936542 254.747970 2.447546 NaN 501.25 26.937545
3 Albania ALB 1.76925 12.327586 574.202694 2.836021 47.201082 29.25 2.888280
4 Andorra AND NaN 3.197538 4421.224933 7.260281 47.123345 NaN 0.079547
5 United Arab Emirates ARE 1.79300 375.027082 2202.407569 2.581168 48.789260 6.00 9.080299
6 Argentina ARG 2.32800 550.980968 1148.256142 2.782216 48.915810 53.75 42.976675
7 Armenia ARM 1.54550 10.885362 348.663884 1.916016 46.782180 27.25 2.904683
8 American Samoa ASM NaN 0.640500 NaN NaN NaN NaN 0.055422
9 Antigua and Barbuda ATG 2.08200 1.298213 1152.493656 3.676514 48.291463 NaN 0.098872

As you know, the names at the top, in bold typeface, are the names of the columns. We can see these names using the columns attribute of the data frame:

gender_data.columns
Index(['country_name', 'country_code', 'fert_rate', 'gdp_us_billion',
       'health_exp_per_cap', 'health_exp_pub', 'prim_ed_girls',
       'mat_mort_ratio', 'population'],
      dtype='object')

These can also be called the column labels.

Here is a summary of the meaning of most of the columns:

  • fert_rate: Fertility rate, total (births per woman).

  • gdp_us_billion: GDP (in current US $ billions).

  • health_exp_per_cap: Health expenditure per capita, PPP (constant 2011 international $).

  • health_exp_pub: Health expenditure, public (% of GDP).

  • prim_ed_girls: Primary education, pupils (% female).

  • mat_mort_ratio: Maternal mortality ratio (modeled estimate, per 100,000 live births).

  • population: Population, total (millions).

We have not paid attention so far to the numbers at the left of the table display, also in bold typeface.

In this case, they are ascending integers, starting at 0, with value 0 for the first row, 1 for the second row, and so on.

These numbers are the index of the data frame. The index contains the row labels.

We can see the row labels / index with the index attribute of the data frame.

gender_data.index
RangeIndex(start=0, stop=216, step=1)

This is saying we have a simple index, that is like a np.arange, starting at 0, and continuing in steps of 1 up to, but not including 216.

So far, the index looks like row numbers, with 0 for the first row, 1 for the second row. We will soon see that the index is more than that.

For example, let us look at the last 15 rows. We can get these with the tail method of the data frame.

last_15 = gender_data.tail(15)
last_15
country_name country_code fert_rate gdp_us_billion health_exp_per_cap health_exp_pub prim_ed_girls mat_mort_ratio population
201 Uruguay URY 2.027000 54.345132 1721.507752 6.044403 48.295555 15.50 3.419977
202 United States USA 1.860875 17369.124600 9060.068657 8.121961 48.758830 14.00 318.558175
203 Uzbekistan UZB 2.372750 61.340649 334.476754 3.118842 48.387434 37.00 30.784500
204 St. Vincent and the Grenadines VCT 1.986000 0.730107 775.803386 4.365757 48.536415 45.75 0.109421
205 Venezuela, RB VEN 2.378250 376.146268 896.815314 1.587088 48.400934 97.00 30.734524
206 British Virgin Islands VGB NaN NaN NaN NaN 47.581520 NaN 0.029585
207 Virgin Islands (U.S.) VIR 1.760000 3.812000 NaN NaN NaN NaN 0.104141
208 Vietnam VNM 1.959500 181.820736 368.374550 3.779501 48.021053 54.75 90.742400
209 Vanuatu VUT 3.364750 0.782876 125.568712 3.689874 47.301617 82.50 0.258896
210 Samoa WSM 4.118250 0.799887 366.353096 5.697059 48.350049 54.75 0.192225
211 Kosovo XKX 2.142500 6.804620 NaN NaN NaN NaN 1.813820
212 Yemen, Rep. YEM 4.225750 36.819337 207.949700 1.417836 44.470076 399.75 26.246608
213 South Africa ZAF 2.375250 345.209888 1123.142656 4.241441 48.516298 143.75 54.177209
214 Zambia ZMB 5.394250 24.280990 185.556359 2.687290 49.934484 233.75 15.633220
215 Zimbabwe ZWE 3.943000 15.495514 115.519881 2.695188 49.529875 398.00 15.420964

Here is the index of this new data frame.

last_15.index
RangeIndex(start=201, stop=216, step=1)

Notice that the first row of this new data frame no longer has the label 0 - it has label 201. These numbers are the labels from the original data frame, from which we selected the rows.

Notice too that the second row in this table is the USA, and it has the label \202.

Now imagine that I sort the original data frame by Gross Domestic Product, and select the top 15 rows. By doing this, I select the top 15 richest countries.

# Sort by GDP.
df_by_gdp = gender_data.sort_values('gdp_us_billion', ascending=False)
# Take the top 15 rows.
richest_15 = df_by_gdp.head(15)
richest_15
country_name country_code fert_rate gdp_us_billion health_exp_per_cap health_exp_pub prim_ed_girls mat_mort_ratio population
202 United States USA 1.860875 17369.124600 9060.068657 8.121961 48.758830 14.00 318.558175
35 China CHN 1.558750 10182.790479 657.748859 3.015530 46.297964 28.75 1364.446000
97 Japan JPN 1.430000 5106.024760 3687.126279 8.496074 48.744199 5.75 127.297102
49 Germany DEU 1.450000 3601.226158 4909.659884 8.542615 48.568695 6.25 81.281645
67 United Kingdom GBR 1.842500 2768.864417 3357.983675 7.720655 48.791809 9.25 64.641557
63 France FRA 2.005000 2647.649725 4387.835406 8.920420 48.772050 8.75 66.302099
26 Brazil BRA 1.795250 2198.765606 1303.199104 3.773473 47.784577 49.50 204.159544
88 India IND 2.449250 2019.005411 241.572477 1.292666 49.497234 185.25 1293.742537
94 Italy ITA 1.390000 2005.983980 3266.984094 6.984374 48.407573 4.00 60.378795
164 Russian Federation RUS 1.724500 1822.691700 1755.506635 3.731354 48.968070 25.25 143.793504
32 Canada CAN 1.600300 1708.473627 4616.539134 7.546247 48.808926 7.25 35.517119
10 Australia AUS 1.861500 1422.994116 4256.058988 6.292381 48.576707 6.00 23.444560
104 Korea, Rep. KOR 1.232000 1346.751162 2385.447251 3.915606 48.023388 12.00 50.727212
58 Spain ESP 1.307500 1299.724261 2963.832825 6.545739 48.722231 5.00 46.553128
124 Mexico MEX 2.257000 1188.802780 1081.208948 3.225839 48.906296 40.00 124.203450

Now notice that the index values have followed their respective rows. The row for the USA is now first, but it has kept its original index value (row label) - 202. The index value identifies the row.

Here is the index for this small table:

richest_15.index
Index([202, 35, 97, 49, 67, 63, 26, 88, 94, 164, 32, 10, 104, 58, 124], dtype='int64')

Now let us say that I wanted to get the eleventh row of this table.

I would be tempted to do something like richest_15[10] - but then - what would Pandas give me? Does 10 mean the row label for “Australia”? Or does it mean the eleventh row of the table, as it would for an array, and therefore the row for “Canada”? That is, will the indexing using the labels (index values)? Or will it use the positions (as for indexing in Numpy arrays)?

The answer is, it is ambiguous, and when things are ambiguous, Pandas has to guess. That’s a bad idea, because then we have to guess what Pandas has guessed, and it is starting to get really confusing.

For that reason, we careful Pandas users do not index rows with code like richest_15[10]. When we are not using Booleans to select rows, we always use one of the two indexing attributes that Pandas provides. Each does a different type of indexing.

  • The loc (label locate) attribute indexes using the labels (row index).

  • The iloc (position locate) attribute indexes using the positions.

Here they are in action:

# Get the row with label 10
richest_15.loc[10]
country_name            Australia
country_code                  AUS
fert_rate                  1.8615
gdp_us_billion        1422.994116
health_exp_per_cap    4256.058988
health_exp_pub           6.292381
prim_ed_girls           48.576707
mat_mort_ratio                6.0
population               23.44456
Name: 10, dtype: object
# Get the row at position / offset 10
richest_15.iloc[10]
country_name               Canada
country_code                  CAN
fert_rate                  1.6003
gdp_us_billion        1708.473627
health_exp_per_cap    4616.539134
health_exp_pub           7.546247
prim_ed_girls           48.808926
mat_mort_ratio               7.25
population              35.517119
Name: 32, dtype: object

In more details, what is happening in the first cell above is that richest_15.loc gives us a special value (AKA object) that:

  • knows about the richest_15 data frame;

  • will give us rows and columns, if we use [ something ] on it.

So, read richest_15.loc[10] as:

  • Get the richest_15 value (the data frame);

  • Get the loc attribute of this data frame. This evaluates to a special object (value) that can give us rows and columns from the data frame, by looking at labels;

  • Use this object to fetch the row labeled 10, by using indexing ([10])

Likewise, read richest_15.iloc[10] as:

  • Get the richest_15 value (the data frame);

  • Get the iloc attribute of this data frame. This evaluates to a special object (value) that can give us rows and columns from the data frame, by looking at positions (offsets);

  • Use this object to fetch the row at offset 10, by using indexing ([10]).

When you select rows and columns from a data frame, consider carefully what you want.

Do you want to select rows and columns by label (by index, in the rows). Then you want .loc.

Do you want to select rows and columns by position (offset). Then you want .iloc.

indexing and slices#

.loc indexing gets you rows and columns by label.

One form of indexing you have seen for arrays, is labeling by slice, to get a sequence of values.

A slice is an expression with a colon (:) in it somewhere. Here’s slice indexing into an array:

my_array = np.array(['one', 'two', 'three', 'four', 'five'])
# Give array of values from position 2 up to, not including position 4
my_array[2:4]
array(['three', 'four'], dtype='<U5')

The same sort of indexing works using both .iloc and .loc. For .iloc, the parallel is obvious - because we specify positions in exactly the same way:

# Rows from position 2, up to, not including, position 4
richest_15.iloc[2:4]
country_name country_code fert_rate gdp_us_billion health_exp_per_cap health_exp_pub prim_ed_girls mat_mort_ratio population
97 Japan JPN 1.43 5106.024760 3687.126279 8.496074 48.744199 5.75 127.297102
49 Germany DEU 1.45 3601.226158 4909.659884 8.542615 48.568695 6.25 81.281645

You can also do this kind of indexing with .loc, but it works differently. To make this clear, let’s start with .loc indexing where the row labels are strings. We set the row labels to be the country codes.

richest_15_with_code = richest_15.set_index('country_code')
richest_15_with_code
country_name fert_rate gdp_us_billion health_exp_per_cap health_exp_pub prim_ed_girls mat_mort_ratio population
country_code
USA United States 1.860875 17369.124600 9060.068657 8.121961 48.758830 14.00 318.558175
CHN China 1.558750 10182.790479 657.748859 3.015530 46.297964 28.75 1364.446000
JPN Japan 1.430000 5106.024760 3687.126279 8.496074 48.744199 5.75 127.297102
DEU Germany 1.450000 3601.226158 4909.659884 8.542615 48.568695 6.25 81.281645
GBR United Kingdom 1.842500 2768.864417 3357.983675 7.720655 48.791809 9.25 64.641557
FRA France 2.005000 2647.649725 4387.835406 8.920420 48.772050 8.75 66.302099
BRA Brazil 1.795250 2198.765606 1303.199104 3.773473 47.784577 49.50 204.159544
IND India 2.449250 2019.005411 241.572477 1.292666 49.497234 185.25 1293.742537
ITA Italy 1.390000 2005.983980 3266.984094 6.984374 48.407573 4.00 60.378795
RUS Russian Federation 1.724500 1822.691700 1755.506635 3.731354 48.968070 25.25 143.793504
CAN Canada 1.600300 1708.473627 4616.539134 7.546247 48.808926 7.25 35.517119
AUS Australia 1.861500 1422.994116 4256.058988 6.292381 48.576707 6.00 23.444560
KOR Korea, Rep. 1.232000 1346.751162 2385.447251 3.915606 48.023388 12.00 50.727212
ESP Spain 1.307500 1299.724261 2963.832825 6.545739 48.722231 5.00 46.553128
MEX Mexico 2.257000 1188.802780 1081.208948 3.225839 48.906296 40.00 124.203450

As we would expect, we will get an error if we try to use numbers for .loc row indexing, because .loc indexing only uses labels, and the labels are now all strings:

richest_15_with_code.loc[2:4]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[22], line 1
----> 1 richest_15_with_code.loc[2:4]

File /opt/hostedtoolcache/Python/3.11.8/x64/lib/python3.11/site-packages/pandas/core/indexing.py:1191, in _LocationIndexer.__getitem__(self, key)
   1189 maybe_callable = com.apply_if_callable(key, self.obj)
   1190 maybe_callable = self._check_deprecated_callable_usage(key, maybe_callable)
-> 1191 return self._getitem_axis(maybe_callable, axis=axis)

File /opt/hostedtoolcache/Python/3.11.8/x64/lib/python3.11/site-packages/pandas/core/indexing.py:1411, in _LocIndexer._getitem_axis(self, key, axis)
   1409 if isinstance(key, slice):
   1410     self._validate_key(key, axis)
-> 1411     return self._get_slice_axis(key, axis=axis)
   1412 elif com.is_bool_indexer(key):
   1413     return self._getbool_axis(key, axis=axis)

File /opt/hostedtoolcache/Python/3.11.8/x64/lib/python3.11/site-packages/pandas/core/indexing.py:1443, in _LocIndexer._get_slice_axis(self, slice_obj, axis)
   1440     return obj.copy(deep=False)
   1442 labels = obj._get_axis(axis)
-> 1443 indexer = labels.slice_indexer(slice_obj.start, slice_obj.stop, slice_obj.step)
   1445 if isinstance(indexer, slice):
   1446     return self.obj._slice(indexer, axis=axis)

File /opt/hostedtoolcache/Python/3.11.8/x64/lib/python3.11/site-packages/pandas/core/indexes/base.py:6662, in Index.slice_indexer(self, start, end, step)
   6618 def slice_indexer(
   6619     self,
   6620     start: Hashable | None = None,
   6621     end: Hashable | None = None,
   6622     step: int | None = None,
   6623 ) -> slice:
   6624     """
   6625     Compute the slice indexer for input labels and step.
   6626 
   (...)
   6660     slice(1, 3, None)
   6661     """
-> 6662     start_slice, end_slice = self.slice_locs(start, end, step=step)
   6664     # return a slice
   6665     if not is_scalar(start_slice):

File /opt/hostedtoolcache/Python/3.11.8/x64/lib/python3.11/site-packages/pandas/core/indexes/base.py:6879, in Index.slice_locs(self, start, end, step)
   6877 start_slice = None
   6878 if start is not None:
-> 6879     start_slice = self.get_slice_bound(start, "left")
   6880 if start_slice is None:
   6881     start_slice = 0

File /opt/hostedtoolcache/Python/3.11.8/x64/lib/python3.11/site-packages/pandas/core/indexes/base.py:6794, in Index.get_slice_bound(self, label, side)
   6790 original_label = label
   6792 # For datetime indices label may be a string that has to be converted
   6793 # to datetime boundary according to its resolution.
-> 6794 label = self._maybe_cast_slice_bound(label, side)
   6796 # we need to look up the label
   6797 try:

File /opt/hostedtoolcache/Python/3.11.8/x64/lib/python3.11/site-packages/pandas/core/indexes/base.py:6727, in Index._maybe_cast_slice_bound(self, label, side)
   6725 # reject them, if index does not contain label
   6726 if (is_float(label) or is_integer(label)) and label not in self:
-> 6727     self._raise_invalid_indexer("slice", label)
   6729 return label

File /opt/hostedtoolcache/Python/3.11.8/x64/lib/python3.11/site-packages/pandas/core/indexes/base.py:4301, in Index._raise_invalid_indexer(self, form, key, reraise)
   4299 if reraise is not lib.no_default:
   4300     raise TypeError(msg) from reraise
-> 4301 raise TypeError(msg)

TypeError: cannot do slice indexing on Index with these indexers [2] of type int

In fact we can use slices, but the value in front of the colon (if any) is the label of the first row we want, and the value after the colon is the label of the last row we want.

Notice that this is different from position indexing for arrays, or using .iloc. In those cases, the value after the colon gives the stop position. We return values from positions up to but not including the stop position. As a reminder:

# From position 2 up to, but not including, position 4
richest_15_with_code.iloc[2:4]
country_name fert_rate gdp_us_billion health_exp_per_cap health_exp_pub prim_ed_girls mat_mort_ratio population
country_code
JPN Japan 1.43 5106.024760 3687.126279 8.496074 48.744199 5.75 127.297102
DEU Germany 1.45 3601.226158 4909.659884 8.542615 48.568695 6.25 81.281645

Now consider this .loc indexing with a slice. Here the value after the colon is not a stop value, but an end label. We do return the row corresponding to the end label:

# All countries starting at label 'JPN' and ending (including)
# row labeled 'FRA'
richest_15_with_code.loc['JPN':'FRA']
country_name fert_rate gdp_us_billion health_exp_per_cap health_exp_pub prim_ed_girls mat_mort_ratio population
country_code
JPN Japan 1.4300 5106.024760 3687.126279 8.496074 48.744199 5.75 127.297102
DEU Germany 1.4500 3601.226158 4909.659884 8.542615 48.568695 6.25 81.281645
GBR United Kingdom 1.8425 2768.864417 3357.983675 7.720655 48.791809 9.25 64.641557
FRA France 2.0050 2647.649725 4387.835406 8.920420 48.772050 8.75 66.302099

For reflection - why did Pandas choose to make label indexing work this way? Have a look at the example above.

Loc and iloc indexing work on columns too#

Consider this. We use .loc and the label for the row, followed by the label for the column.

richest_15.loc[10, 'gdp_us_billion']
1422.994116449394

This selects the row labeled 10 (the USA row) and returns the value from the gdp_us_billion column.

Conversely, iloc only understands positions, and so, integers. It does not understand or use labels:

# We can't specify columns by label, using iloc.
richest_15.iloc[10, 'gdp_us_billion']
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File /opt/hostedtoolcache/Python/3.11.8/x64/lib/python3.11/site-packages/pandas/core/indexing.py:966, in _LocationIndexer._validate_tuple_indexer(self, key)
    965 try:
--> 966     self._validate_key(k, i)
    967 except ValueError as err:

File /opt/hostedtoolcache/Python/3.11.8/x64/lib/python3.11/site-packages/pandas/core/indexing.py:1614, in _iLocIndexer._validate_key(self, key, axis)
   1613 else:
-> 1614     raise ValueError(f"Can only index by location with a [{self._valid_types}]")

ValueError: Can only index by location with a [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array]

The above exception was the direct cause of the following exception:

ValueError                                Traceback (most recent call last)
Cell In[26], line 2
      1 # We can't specify columns by label, using iloc.
----> 2 richest_15.iloc[10, 'gdp_us_billion']

File /opt/hostedtoolcache/Python/3.11.8/x64/lib/python3.11/site-packages/pandas/core/indexing.py:1184, in _LocationIndexer.__getitem__(self, key)
   1182     if self._is_scalar_access(key):
   1183         return self.obj._get_value(*key, takeable=self._takeable)
-> 1184     return self._getitem_tuple(key)
   1185 else:
   1186     # we by definition only have the 0th axis
   1187     axis = self.axis or 0

File /opt/hostedtoolcache/Python/3.11.8/x64/lib/python3.11/site-packages/pandas/core/indexing.py:1690, in _iLocIndexer._getitem_tuple(self, tup)
   1689 def _getitem_tuple(self, tup: tuple):
-> 1690     tup = self._validate_tuple_indexer(tup)
   1691     with suppress(IndexingError):
   1692         return self._getitem_lowerdim(tup)

File /opt/hostedtoolcache/Python/3.11.8/x64/lib/python3.11/site-packages/pandas/core/indexing.py:968, in _LocationIndexer._validate_tuple_indexer(self, key)
    966         self._validate_key(k, i)
    967     except ValueError as err:
--> 968         raise ValueError(
    969             "Location based indexing can only have "
    970             f"[{self._valid_types}] types"
    971         ) from err
    972 return key

ValueError: Location based indexing can only have [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array] types

Notice the somewhat informative error right at the bottom of the error message.

iloc can get columns by their position. For example, fertility_rate is the third column, and so we can get the fertility rate by asking iloc for the eleventh row, third column, like so:

richest_15.iloc[10, 2]
1.6002999999999998

This is the value from the row at offset 10 and the column at offset 2.

Series have an index too#

Data frames have columns (and column labels), and rows (with row labels — the index).

Series also have element labels — they also have an index.

Consider this column, of the fertility rates of the richest countries:

# Get the fert_rate column from the data frame
rich_ferts = richest_15['fert_rate']
rich_ferts.head()
202    1.860875
35     1.558750
97     1.430000
49     1.450000
67     1.842500
Name: fert_rate, dtype: float64

As y’all know, this is a Series:

type(rich_ferts)
pandas.core.series.Series

You probably also spotted the numbers to the left of the values. Yes, the row labels have come across from the data frame into the Series. We can see that the first element has row label 202, and we remember that this is the row label (index value) for the USA.

As for the data frames, here is the index for the series:

rich_ferts.index
Index([202, 35, 97, 49, 67, 63, 26, 88, 94, 164, 32, 10, 104, 58, 124], dtype='int64')

Like data frames, we can index by label, with loc, and by position, with iloc:

# The value for row labeled 10 - Australia.
rich_ferts.loc[10]
1.8615
# The value for the row at position 10 - Canada
rich_ferts.iloc[10]
1.6002999999999998

Rows from data frames are Series#

Here we select the row at position 0 (the first row):

first_row = richest_15.iloc[0]
first_row
country_name          United States
country_code                    USA
fert_rate                  1.860875
gdp_us_billion           17369.1246
health_exp_per_cap      9060.068657
health_exp_pub             8.121961
prim_ed_girls              48.75883
mat_mort_ratio                 14.0
population               318.558175
Name: 202, dtype: object

This is a Series:

type(first_row)
pandas.core.series.Series

Notice that the Series has 9 elements, one for each column in the data frame.

As you remember, we can see the number of rows and columns in a data frame with:

richest_15.shape
(15, 9)

Remember, the columns have names, which we can also call labels.

When we select a row out of a data frame, we have one element per column, and the labels for the columns now become the labels for the rows.

first_row.index
Index(['country_name', 'country_code', 'fert_rate', 'gdp_us_billion',
       'health_exp_per_cap', 'health_exp_pub', 'prim_ed_girls',
       'mat_mort_ratio', 'population'],
      dtype='object')

As for any other Series, we can get elements by label, with loc:

first_row.loc['fert_rate']
1.860875

We can also get elements by position (offset), with iloc:

first_row.iloc[0]
'United States'

You can index without loc and iloc, but please don’t, except …#

In the pages following the data frame introduction, you saw that we can index into data frame directly, using [ something ] after the data frame value, without using loc or iloc. Call this direct indexing. This is a very reasonable thing to do if:

  • the something is a Boolean Series (DIBS). This is not ambiguous.

  • the something is a column label or list of labels (DICL). This isn’t ambiguous either.

Both of these are entirely reasonable uses of direct indexing:

# Direct indexing with a Boolean Series (DIBS).
# Select rows where 'health_exp_per_cap' is greater than 3000 dollars.
is_big_spender = richest_15['health_exp_per_cap'] > 3000  # Boolean Series
big_health_spenders = richest_15[is_big_spender]
big_health_spenders
country_name country_code fert_rate gdp_us_billion health_exp_per_cap health_exp_pub prim_ed_girls mat_mort_ratio population
202 United States USA 1.860875 17369.124600 9060.068657 8.121961 48.758830 14.00 318.558175
97 Japan JPN 1.430000 5106.024760 3687.126279 8.496074 48.744199 5.75 127.297102
49 Germany DEU 1.450000 3601.226158 4909.659884 8.542615 48.568695 6.25 81.281645
67 United Kingdom GBR 1.842500 2768.864417 3357.983675 7.720655 48.791809 9.25 64.641557
63 France FRA 2.005000 2647.649725 4387.835406 8.920420 48.772050 8.75 66.302099
94 Italy ITA 1.390000 2005.983980 3266.984094 6.984374 48.407573 4.00 60.378795
32 Canada CAN 1.600300 1708.473627 4616.539134 7.546247 48.808926 7.25 35.517119
10 Australia AUS 1.861500 1422.994116 4256.058988 6.292381 48.576707 6.00 23.444560
# Direct indexing with column labels (DICL).
# Select the 'country_code' column.
country_code = richest_15['country_code']
country_code
202    USA
35     CHN
97     JPN
49     DEU
67     GBR
63     FRA
26     BRA
88     IND
94     ITA
164    RUS
32     CAN
10     AUS
104    KOR
58     ESP
124    MEX
Name: country_code, dtype: object

Those uses of direct indexing are entirely sensible. Other uses of direct indexing are not sensible, and we strongly encourage you not to use them.

For example, Pandas will also allow you to this, with the Series we just made:

# Nooooooo !
country_code[10]
'AUS'

Please don’t do that. This is ambiguous. Did you really mean the country with label 10? Or did you mean the country at position 10? Pandas will guess for you. Don’t let it. If you are not selecting rows / elements with Booleans, always do this:

# By label
country_code.loc[10]
'AUS'

or this:

# By position
country_code.iloc[10]
'CAN'

You can use Boolean indexing with .loc#

It is often convenient to use direct indexing with Booleans, as we have just seen:

# Create a Boolean series with True for big spender rows, False otherwise.
is_big_spender = richest_15['health_exp_per_cap'] > 3000
is_big_spender
202     True
35     False
97      True
49      True
67      True
63      True
26     False
88     False
94      True
164    False
32      True
10      True
104    False
58     False
124    False
Name: health_exp_per_cap, dtype: bool
# Direct Boolean indexing:
# Select rows where 'health_exp_per_cap' is greater than 3000 dollars.
big_health_spenders = richest_15[is_big_spender]
big_health_spenders
country_name country_code fert_rate gdp_us_billion health_exp_per_cap health_exp_pub prim_ed_girls mat_mort_ratio population
202 United States USA 1.860875 17369.124600 9060.068657 8.121961 48.758830 14.00 318.558175
97 Japan JPN 1.430000 5106.024760 3687.126279 8.496074 48.744199 5.75 127.297102
49 Germany DEU 1.450000 3601.226158 4909.659884 8.542615 48.568695 6.25 81.281645
67 United Kingdom GBR 1.842500 2768.864417 3357.983675 7.720655 48.791809 9.25 64.641557
63 France FRA 2.005000 2647.649725 4387.835406 8.920420 48.772050 8.75 66.302099
94 Italy ITA 1.390000 2005.983980 3266.984094 6.984374 48.407573 4.00 60.378795
32 Canada CAN 1.600300 1708.473627 4616.539134 7.546247 48.808926 7.25 35.517119
10 Australia AUS 1.861500 1422.994116 4256.058988 6.292381 48.576707 6.00 23.444560

We can also use Booleans with loc indexing. This allows us to do things such as selecting rows with Booleans, and columns with labels, in one shot, like this:

# Boolean indexing for rows, with "loc"
# Get "country_name" column values for big health spenders:
big_spender_names = richest_15.loc[is_big_spender, 'country_name']
big_spender_names
202     United States
97              Japan
49            Germany
67     United Kingdom
63             France
94              Italy
32             Canada
10          Australia
Name: country_name, dtype: object

In fact you can use some Boolean sequences with iloc as well, but that needs a little more explanation.