Indexing in Pandas#
Way back in the basic row indexing, we showed row selection by indexing into data frames with Boolean vectors, such as Pandas Series. We called this direct indexing with Boolean Series (DIBS). We also covered selecting columns by indexing into data frames with column labels. We called this direct indexing with column labels (DICL).
Simple indexing and the golden rule#
We will use the term simple indexing to cover both of DIBS and DICL.
Simple Indexing
Direct indexing with either Boolean Series (DIBS) or column labels (DICL).
Here follow some examples.
We start with our usual imports:
# Load the Numpy library, rename to "np"
import numpy as np
# Load the Pandas data science library, rename to "pd"
import pandas as pd
# Safe setting for Pandas. Needs Pandas version >= 1.5.
pd.set_option('mode.copy_on_write', True)
We return to a version of our first data frame, containing statistics on development and gender from the World Bank.
This is the gender_stats.csv
file.
See the gender stats data
page for
some more detail.
Download that file to the same directory as this notebook, if you are running on your own computer.
# Load the data file as a data frame.
gender_data = pd.read_csv('gender_stats.csv')
# Show the first 10 rows
gender_data.head(10)
country_name | country_code | fert_rate | gdp_us_billion | health_exp_per_cap | health_exp_pub | prim_ed_girls | mat_mort_ratio | population | |
---|---|---|---|---|---|---|---|---|---|
0 | Aruba | ABW | 1.66325 | NaN | NaN | NaN | 48.721939 | NaN | 0.103744 |
1 | Afghanistan | AFG | 4.95450 | 19.961015 | 161.138034 | 2.834598 | 40.109708 | 444.00 | 32.715838 |
2 | Angola | AGO | 6.12300 | 111.936542 | 254.747970 | 2.447546 | NaN | 501.25 | 26.937545 |
3 | Albania | ALB | 1.76925 | 12.327586 | 574.202694 | 2.836021 | 47.201082 | 29.25 | 2.888280 |
4 | Andorra | AND | NaN | 3.197538 | 4421.224933 | 7.260281 | 47.123345 | NaN | 0.079547 |
5 | United Arab Emirates | ARE | 1.79300 | 375.027082 | 2202.407569 | 2.581168 | 48.789260 | 6.00 | 9.080299 |
6 | Argentina | ARG | 2.32800 | 550.980968 | 1148.256142 | 2.782216 | 48.915810 | 53.75 | 42.976675 |
7 | Armenia | ARM | 1.54550 | 10.885362 | 348.663884 | 1.916016 | 46.782180 | 27.25 | 2.904683 |
8 | American Samoa | ASM | NaN | 0.640500 | NaN | NaN | NaN | NaN | 0.055422 |
9 | Antigua and Barbuda | ATG | 2.08200 | 1.298213 | 1152.493656 | 3.676514 | 48.291463 | NaN | 0.098872 |
Direct Indexing with Column Labels (DICL)#
Direct Indexing with Column Labels (DICL) happens when we do direct indexing (square brackets immediately follow the data frame variable), where one or more column labels go inside the square brackets.
For example, here we use DICL to get the values for one column as a Pandas Series:
# DICL with a single column name
gender_data['mat_mort_ratio']
0 NaN
1 444.00
2 501.25
3 29.25
4 NaN
...
211 NaN
212 399.75
213 143.75
214 233.75
215 398.00
Name: mat_mort_ratio, Length: 216, dtype: float64
We can use more than one column label by putting the column labels in a list:
# DICL with a two column names in a list.
cols_we_want = ['gdp_us_billion', 'mat_mort_ratio']
gender_data[cols_we_want]
gdp_us_billion | mat_mort_ratio | |
---|---|---|
0 | NaN | NaN |
1 | 19.961015 | 444.00 |
2 | 111.936542 | 501.25 |
3 | 12.327586 | 29.25 |
4 | 3.197538 | NaN |
... | ... | ... |
211 | 6.804620 | NaN |
212 | 36.819337 | 399.75 |
213 | 345.209888 | 143.75 |
214 | 24.280990 | 233.75 |
215 | 15.495514 | 398.00 |
216 rows × 2 columns
# DICL with a two column names in a list.
# Here we create the list inside the indexing brackets, instead
# of first creating it and storing it in a variable.
# Notice the two uses of square brackets - following "gender_data" to
# do indexing, and around the column labels to create a list.
gender_data[['gdp_us_billion', 'mat_mort_ratio']]
gdp_us_billion | mat_mort_ratio | |
---|---|---|
0 | NaN | NaN |
1 | 19.961015 | 444.00 |
2 | 111.936542 | 501.25 |
3 | 12.327586 | 29.25 |
4 | 3.197538 | NaN |
... | ... | ... |
211 | 6.804620 | NaN |
212 | 36.819337 | 399.75 |
213 | 345.209888 | 143.75 |
214 | 24.280990 | 233.75 |
215 | 15.495514 | 398.00 |
216 rows × 2 columns
DICL gives a single column (Series) with a single label, or a subset of columns (a new data frame) when we pass a list of column labels.
Direct Indexing with Boolean Series (DIBS)#
Direct Indexing with a Boolean Series (DIBS) happens when we put a Boolean Series within the indexing brackets.
DIBS selects a subset of the rows. Specifically, it selects the rows where the Boolean Series has True. It returns a subset of the original dataframe.
For example, let us say we wanted to select the rows in the gender_data
data frame where the mat_mort_ratio
value was greater than 100.
We first create a Boolean Series, with True for rows where mat_mort_ratio
is
greater than 100, and False otherwise.
# DICL to get the mat_mort_ratio column as a Series.
mmr = gender_data['mat_mort_ratio']
# Make a Boolean Series with True for > 100, False otherwise.
mmr_gt_100 = mmr > 100
We index with the Boolean Series (DIBS), to select only those rows where mat_mort_ratio
was greater than 100. Notice that you get a data frame back.
# DIBS
gender_data[mmr_gt_100]
country_name | country_code | fert_rate | gdp_us_billion | health_exp_per_cap | health_exp_pub | prim_ed_girls | mat_mort_ratio | population | |
---|---|---|---|---|---|---|---|---|---|
1 | Afghanistan | AFG | 4.95450 | 19.961015 | 161.138034 | 2.834598 | 40.109708 | 444.00 | 32.715838 |
2 | Angola | AGO | 6.12300 | 111.936542 | 254.747970 | 2.447546 | NaN | 501.25 | 26.937545 |
13 | Burundi | BDI | 5.99225 | 2.876978 | 59.693830 | 4.433749 | 50.481971 | 747.25 | 9.907015 |
15 | Benin | BEN | 4.80675 | 8.778151 | 83.726190 | 2.206916 | 47.211127 | 417.50 | 10.293715 |
16 | Burkina Faso | BFA | 5.56450 | 11.753054 | 87.143009 | 3.021246 | 48.217520 | 384.25 | 17.597395 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
199 | Uganda | UGA | 5.82250 | 25.941461 | 132.892684 | 2.014349 | 50.099485 | 366.50 | 38.865339 |
212 | Yemen, Rep. | YEM | 4.22575 | 36.819337 | 207.949700 | 1.417836 | 44.470076 | 399.75 | 26.246608 |
213 | South Africa | ZAF | 2.37525 | 345.209888 | 1123.142656 | 4.241441 | 48.516298 | 143.75 | 54.177209 |
214 | Zambia | ZMB | 5.39425 | 24.280990 | 185.556359 | 2.687290 | 49.934484 | 233.75 | 15.633220 |
215 | Zimbabwe | ZWE | 3.94300 | 15.495514 | 115.519881 | 2.695188 | 49.529875 | 398.00 | 15.420964 |
72 rows × 9 columns
Here we created the Boolean Series in two steps, and then used it to do the indexing. It’s more common to create the Boolean Series with an expression inside the indexing brackets:
# DIBS, creating the BS with an expression inside the indexing brackets.
gender_data[gender_data['mat_mort_ratio'] > 100]
country_name | country_code | fert_rate | gdp_us_billion | health_exp_per_cap | health_exp_pub | prim_ed_girls | mat_mort_ratio | population | |
---|---|---|---|---|---|---|---|---|---|
1 | Afghanistan | AFG | 4.95450 | 19.961015 | 161.138034 | 2.834598 | 40.109708 | 444.00 | 32.715838 |
2 | Angola | AGO | 6.12300 | 111.936542 | 254.747970 | 2.447546 | NaN | 501.25 | 26.937545 |
13 | Burundi | BDI | 5.99225 | 2.876978 | 59.693830 | 4.433749 | 50.481971 | 747.25 | 9.907015 |
15 | Benin | BEN | 4.80675 | 8.778151 | 83.726190 | 2.206916 | 47.211127 | 417.50 | 10.293715 |
16 | Burkina Faso | BFA | 5.56450 | 11.753054 | 87.143009 | 3.021246 | 48.217520 | 384.25 | 17.597395 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
199 | Uganda | UGA | 5.82250 | 25.941461 | 132.892684 | 2.014349 | 50.099485 | 366.50 | 38.865339 |
212 | Yemen, Rep. | YEM | 4.22575 | 36.819337 | 207.949700 | 1.417836 | 44.470076 | 399.75 | 26.246608 |
213 | South Africa | ZAF | 2.37525 | 345.209888 | 1123.142656 | 4.241441 | 48.516298 | 143.75 | 54.177209 |
214 | Zambia | ZMB | 5.39425 | 24.280990 | 185.556359 | 2.687290 | 49.934484 | 233.75 | 15.633220 |
215 | Zimbabwe | ZWE | 3.94300 | 15.495514 | 115.519881 | 2.695188 | 49.529875 | 398.00 | 15.420964 |
72 rows × 9 columns
Notice that this is exactly the same as the indexing above; Python first sees
and evaluates the expression inside the square brackets:
gender_data['mat_mort_ratio'] > 100
, to make the Boolean Series, and then
passes the resulting series to the data frame indexing machinery to select the
rows.
Indirect indexing#
The rest of this page covers more sophisticated indexing in Pandas - indirect indexing.
Why don’t we use the term direct indexing instead of simple indexing?
Because there are various other things you can put inside the square brackets
using direct indexing. We won’t go into other uses of direct indexing here,
because we strongly recommend that you restrict yourself to DIBS and DICL for
direct indexing — because other uses of direct indexing can quickly become
confusing and ambiguous, and because you can do all the indexing you need using
the indirect indexing techniques in this page. As you will see, indirect
indexing uses the .loc
and .iloc
attributes of the DataFrame.
Here is the rule we propose.
The Golden Rule of Pandas Indexing
Only use Boolean Series or column labels for direct indexing. For any other type of indexing, use indirect indexing with
.loc
and.iloc
.
We do not just advocate this rule for beginners. We (your humble authors) also stick to this rule, not only in this textbook, but in our daily work. By sticking to this rule, you will avoid some nasty and confusing errors, and you will find it easier to think clearly about indexing.
Indirect indexing#
To understand indirect indexing, we return to the index of data frames and Series. The index contains the row labels for a data frame or series.
We start with our usual imports:
# Load the Numpy library, rename to "np"
import numpy as np
# Load the Pandas data science library, rename to "pd"
import pandas as pd
# Safe setting for Pandas. Needs Pandas version >= 1.5.
pd.set_option('mode.copy_on_write', True)
We return to a version of our first data frame, containing statistics on development and gender from the World Bank.
This is the gender_stats.csv
file.
See the gender stats data
page for
some more detail.
Download that file to the same directory as this notebook, if you are running on your own computer.
# Load the data file as a data frame.
gender_data = pd.read_csv('gender_stats.csv')
# Show the first 10 rows
gender_data.head(10)
country_name | country_code | fert_rate | gdp_us_billion | health_exp_per_cap | health_exp_pub | prim_ed_girls | mat_mort_ratio | population | |
---|---|---|---|---|---|---|---|---|---|
0 | Aruba | ABW | 1.66325 | NaN | NaN | NaN | 48.721939 | NaN | 0.103744 |
1 | Afghanistan | AFG | 4.95450 | 19.961015 | 161.138034 | 2.834598 | 40.109708 | 444.00 | 32.715838 |
2 | Angola | AGO | 6.12300 | 111.936542 | 254.747970 | 2.447546 | NaN | 501.25 | 26.937545 |
3 | Albania | ALB | 1.76925 | 12.327586 | 574.202694 | 2.836021 | 47.201082 | 29.25 | 2.888280 |
4 | Andorra | AND | NaN | 3.197538 | 4421.224933 | 7.260281 | 47.123345 | NaN | 0.079547 |
5 | United Arab Emirates | ARE | 1.79300 | 375.027082 | 2202.407569 | 2.581168 | 48.789260 | 6.00 | 9.080299 |
6 | Argentina | ARG | 2.32800 | 550.980968 | 1148.256142 | 2.782216 | 48.915810 | 53.75 | 42.976675 |
7 | Armenia | ARM | 1.54550 | 10.885362 | 348.663884 | 1.916016 | 46.782180 | 27.25 | 2.904683 |
8 | American Samoa | ASM | NaN | 0.640500 | NaN | NaN | NaN | NaN | 0.055422 |
9 | Antigua and Barbuda | ATG | 2.08200 | 1.298213 | 1152.493656 | 3.676514 | 48.291463 | NaN | 0.098872 |
As you know, the names at the top, in bold typeface, are the names of the
columns. We can see these names using the columns
attribute of the data
frame:
gender_data.columns
Index(['country_name', 'country_code', 'fert_rate', 'gdp_us_billion',
'health_exp_per_cap', 'health_exp_pub', 'prim_ed_girls',
'mat_mort_ratio', 'population'],
dtype='object')
These can also be called the column labels.
Here is a summary of the meaning of most of the columns:
fert_rate
: Fertility rate, total (births per woman).gdp_us_billion
: GDP (in current US $ billions).health_exp_per_cap
: Health expenditure per capita, PPP (constant 2011 international $).health_exp_pub
: Health expenditure, public (% of GDP).prim_ed_girls
: Primary education, pupils (% female).mat_mort_ratio
: Maternal mortality ratio (modeled estimate, per 100,000 live births).population
: Population, total (millions).
We have not paid attention so far to the numbers at the left of the table display, also in bold typeface.
In this case, they are ascending integers, starting at 0, with value 0 for the first row, 1 for the second row, and so on.
These numbers are the index of the data frame. The index contains the row labels.
We can see the row labels / index with the index
attribute of the data frame.
gender_data.index
RangeIndex(start=0, stop=216, step=1)
This is saying we have a simple index, that is like a np.arange
, starting at
0, and continuing in steps of 1 up to, but not including 216.
So far, the index looks like row numbers, with 0 for the first row, 1 for the second row. We will soon see that the index is more than that.
For example, let us look at the last 15 rows. We can get these with the tail
method of the data frame.
last_15 = gender_data.tail(15)
last_15
country_name | country_code | fert_rate | gdp_us_billion | health_exp_per_cap | health_exp_pub | prim_ed_girls | mat_mort_ratio | population | |
---|---|---|---|---|---|---|---|---|---|
201 | Uruguay | URY | 2.027000 | 54.345132 | 1721.507752 | 6.044403 | 48.295555 | 15.50 | 3.419977 |
202 | United States | USA | 1.860875 | 17369.124600 | 9060.068657 | 8.121961 | 48.758830 | 14.00 | 318.558175 |
203 | Uzbekistan | UZB | 2.372750 | 61.340649 | 334.476754 | 3.118842 | 48.387434 | 37.00 | 30.784500 |
204 | St. Vincent and the Grenadines | VCT | 1.986000 | 0.730107 | 775.803386 | 4.365757 | 48.536415 | 45.75 | 0.109421 |
205 | Venezuela, RB | VEN | 2.378250 | 376.146268 | 896.815314 | 1.587088 | 48.400934 | 97.00 | 30.734524 |
206 | British Virgin Islands | VGB | NaN | NaN | NaN | NaN | 47.581520 | NaN | 0.029585 |
207 | Virgin Islands (U.S.) | VIR | 1.760000 | 3.812000 | NaN | NaN | NaN | NaN | 0.104141 |
208 | Vietnam | VNM | 1.959500 | 181.820736 | 368.374550 | 3.779501 | 48.021053 | 54.75 | 90.742400 |
209 | Vanuatu | VUT | 3.364750 | 0.782876 | 125.568712 | 3.689874 | 47.301617 | 82.50 | 0.258896 |
210 | Samoa | WSM | 4.118250 | 0.799887 | 366.353096 | 5.697059 | 48.350049 | 54.75 | 0.192225 |
211 | Kosovo | XKX | 2.142500 | 6.804620 | NaN | NaN | NaN | NaN | 1.813820 |
212 | Yemen, Rep. | YEM | 4.225750 | 36.819337 | 207.949700 | 1.417836 | 44.470076 | 399.75 | 26.246608 |
213 | South Africa | ZAF | 2.375250 | 345.209888 | 1123.142656 | 4.241441 | 48.516298 | 143.75 | 54.177209 |
214 | Zambia | ZMB | 5.394250 | 24.280990 | 185.556359 | 2.687290 | 49.934484 | 233.75 | 15.633220 |
215 | Zimbabwe | ZWE | 3.943000 | 15.495514 | 115.519881 | 2.695188 | 49.529875 | 398.00 | 15.420964 |
Here is the index
of this new data frame.
last_15.index
RangeIndex(start=201, stop=216, step=1)
Notice that the first row of this new data frame no longer has the label 0 - it has label 201. These numbers are the labels from the original data frame, from which we selected the rows.
Notice too that the second row in this table is the USA, and it has the label \202.
Now imagine that I sort the original data frame by Gross Domestic Product, and select the top 15 rows. By doing this, I select the top 15 richest countries.
# Sort by GDP.
df_by_gdp = gender_data.sort_values('gdp_us_billion', ascending=False)
# Take the top 15 rows.
richest_15 = df_by_gdp.head(15)
richest_15
country_name | country_code | fert_rate | gdp_us_billion | health_exp_per_cap | health_exp_pub | prim_ed_girls | mat_mort_ratio | population | |
---|---|---|---|---|---|---|---|---|---|
202 | United States | USA | 1.860875 | 17369.124600 | 9060.068657 | 8.121961 | 48.758830 | 14.00 | 318.558175 |
35 | China | CHN | 1.558750 | 10182.790479 | 657.748859 | 3.015530 | 46.297964 | 28.75 | 1364.446000 |
97 | Japan | JPN | 1.430000 | 5106.024760 | 3687.126279 | 8.496074 | 48.744199 | 5.75 | 127.297102 |
49 | Germany | DEU | 1.450000 | 3601.226158 | 4909.659884 | 8.542615 | 48.568695 | 6.25 | 81.281645 |
67 | United Kingdom | GBR | 1.842500 | 2768.864417 | 3357.983675 | 7.720655 | 48.791809 | 9.25 | 64.641557 |
63 | France | FRA | 2.005000 | 2647.649725 | 4387.835406 | 8.920420 | 48.772050 | 8.75 | 66.302099 |
26 | Brazil | BRA | 1.795250 | 2198.765606 | 1303.199104 | 3.773473 | 47.784577 | 49.50 | 204.159544 |
88 | India | IND | 2.449250 | 2019.005411 | 241.572477 | 1.292666 | 49.497234 | 185.25 | 1293.742537 |
94 | Italy | ITA | 1.390000 | 2005.983980 | 3266.984094 | 6.984374 | 48.407573 | 4.00 | 60.378795 |
164 | Russian Federation | RUS | 1.724500 | 1822.691700 | 1755.506635 | 3.731354 | 48.968070 | 25.25 | 143.793504 |
32 | Canada | CAN | 1.600300 | 1708.473627 | 4616.539134 | 7.546247 | 48.808926 | 7.25 | 35.517119 |
10 | Australia | AUS | 1.861500 | 1422.994116 | 4256.058988 | 6.292381 | 48.576707 | 6.00 | 23.444560 |
104 | Korea, Rep. | KOR | 1.232000 | 1346.751162 | 2385.447251 | 3.915606 | 48.023388 | 12.00 | 50.727212 |
58 | Spain | ESP | 1.307500 | 1299.724261 | 2963.832825 | 6.545739 | 48.722231 | 5.00 | 46.553128 |
124 | Mexico | MEX | 2.257000 | 1188.802780 | 1081.208948 | 3.225839 | 48.906296 | 40.00 | 124.203450 |
Now notice that the index values have followed their respective rows. The row for the USA is now first, but it has kept its original index value (row label) - 202. The index value identifies the row.
Here is the index for this small table:
richest_15.index
Index([202, 35, 97, 49, 67, 63, 26, 88, 94, 164, 32, 10, 104, 58, 124], dtype='int64')
Now let us say that I wanted to get the eleventh row of this table.
I would be tempted to do something like richest_15[10]
- but then - what
would Pandas give me? Does 10
mean the row label for “Australia”? Or does
it mean the eleventh row of the table, as it would for an array, and therefore
the row for “Canada”? That is, will the indexing using the labels (index
values)? Or will it use the positions (as for indexing in Numpy arrays)?
The answer is, it is ambiguous, and when things are ambiguous, Pandas has to guess. That’s a bad idea, because then we have to guess what Pandas has guessed, and it is starting to get really confusing.
For that reason, we careful Pandas users do not index rows with code like
richest_15[10]
. When we are not using Booleans to select rows, we always use
one of the two indexing attributes that Pandas provides. Each does a different
type of indexing.
The
loc
(label locate) attribute indexes using the labels (row index).The
iloc
(position locate) attribute indexes using the positions.
Here they are in action:
# Get the row with label 10
richest_15.loc[10]
country_name Australia
country_code AUS
fert_rate 1.8615
gdp_us_billion 1422.994116
health_exp_per_cap 4256.058988
health_exp_pub 6.292381
prim_ed_girls 48.576707
mat_mort_ratio 6.0
population 23.44456
Name: 10, dtype: object
# Get the row at position / offset 10
richest_15.iloc[10]
country_name Canada
country_code CAN
fert_rate 1.6003
gdp_us_billion 1708.473627
health_exp_per_cap 4616.539134
health_exp_pub 7.546247
prim_ed_girls 48.808926
mat_mort_ratio 7.25
population 35.517119
Name: 32, dtype: object
In more details, what is happening in the first cell above is that
richest_15.loc
gives us a special value (AKA object) that:
knows about the
richest_15
data frame;will give us rows and columns, if we use
[ something ]
on it.
So, read richest_15.loc[10]
as:
Get the
richest_15
value (the data frame);Get the
loc
attribute of this data frame. This evaluates to a special object (value) that can give us rows and columns from the data frame, by looking at labels;Use this object to fetch the row labeled 10, by using indexing (
[10]
)
Likewise, read richest_15.iloc[10]
as:
Get the
richest_15
value (the data frame);Get the
iloc
attribute of this data frame. This evaluates to a special object (value) that can give us rows and columns from the data frame, by looking at positions (offsets);Use this object to fetch the row at offset 10, by using indexing (
[10]
).
When you select rows and columns from a data frame, consider carefully what you want.
Do you want to select rows and columns by label (by index, in the rows).
Then you want .loc
.
Do you want to select rows and columns by position (offset). Then you want
.iloc
.
indexing and slices#
.loc
indexing gets you rows and columns by label.
One form of indexing you have seen for arrays, is labeling by slice, to get a sequence of values.
A slice is an expression with a colon (:
) in it somewhere. Here’s slice
indexing into an array:
my_array = np.array(['one', 'two', 'three', 'four', 'five'])
# Give array of values from position 2 up to, not including position 4
my_array[2:4]
array(['three', 'four'], dtype='<U5')
The same sort of indexing works using both .iloc
and .loc
. For .iloc
,
the parallel is obvious - because we specify positions in exactly the same way:
# Rows from position 2, up to, not including, position 4
richest_15.iloc[2:4]
country_name | country_code | fert_rate | gdp_us_billion | health_exp_per_cap | health_exp_pub | prim_ed_girls | mat_mort_ratio | population | |
---|---|---|---|---|---|---|---|---|---|
97 | Japan | JPN | 1.43 | 5106.024760 | 3687.126279 | 8.496074 | 48.744199 | 5.75 | 127.297102 |
49 | Germany | DEU | 1.45 | 3601.226158 | 4909.659884 | 8.542615 | 48.568695 | 6.25 | 81.281645 |
You can also do this kind of indexing with .loc
, but it works differently.
To make this clear, let’s start with .loc
indexing where the row labels are
strings. We set the row labels to be the country codes.
richest_15_with_code = richest_15.set_index('country_code')
richest_15_with_code
country_name | fert_rate | gdp_us_billion | health_exp_per_cap | health_exp_pub | prim_ed_girls | mat_mort_ratio | population | |
---|---|---|---|---|---|---|---|---|
country_code | ||||||||
USA | United States | 1.860875 | 17369.124600 | 9060.068657 | 8.121961 | 48.758830 | 14.00 | 318.558175 |
CHN | China | 1.558750 | 10182.790479 | 657.748859 | 3.015530 | 46.297964 | 28.75 | 1364.446000 |
JPN | Japan | 1.430000 | 5106.024760 | 3687.126279 | 8.496074 | 48.744199 | 5.75 | 127.297102 |
DEU | Germany | 1.450000 | 3601.226158 | 4909.659884 | 8.542615 | 48.568695 | 6.25 | 81.281645 |
GBR | United Kingdom | 1.842500 | 2768.864417 | 3357.983675 | 7.720655 | 48.791809 | 9.25 | 64.641557 |
FRA | France | 2.005000 | 2647.649725 | 4387.835406 | 8.920420 | 48.772050 | 8.75 | 66.302099 |
BRA | Brazil | 1.795250 | 2198.765606 | 1303.199104 | 3.773473 | 47.784577 | 49.50 | 204.159544 |
IND | India | 2.449250 | 2019.005411 | 241.572477 | 1.292666 | 49.497234 | 185.25 | 1293.742537 |
ITA | Italy | 1.390000 | 2005.983980 | 3266.984094 | 6.984374 | 48.407573 | 4.00 | 60.378795 |
RUS | Russian Federation | 1.724500 | 1822.691700 | 1755.506635 | 3.731354 | 48.968070 | 25.25 | 143.793504 |
CAN | Canada | 1.600300 | 1708.473627 | 4616.539134 | 7.546247 | 48.808926 | 7.25 | 35.517119 |
AUS | Australia | 1.861500 | 1422.994116 | 4256.058988 | 6.292381 | 48.576707 | 6.00 | 23.444560 |
KOR | Korea, Rep. | 1.232000 | 1346.751162 | 2385.447251 | 3.915606 | 48.023388 | 12.00 | 50.727212 |
ESP | Spain | 1.307500 | 1299.724261 | 2963.832825 | 6.545739 | 48.722231 | 5.00 | 46.553128 |
MEX | Mexico | 2.257000 | 1188.802780 | 1081.208948 | 3.225839 | 48.906296 | 40.00 | 124.203450 |
As we would expect, we will get an error if we try to use numbers for .loc
row indexing, because .loc
indexing only uses labels, and the labels are now
all strings:
richest_15_with_code.loc[2:4]
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[22], line 1
----> 1 richest_15_with_code.loc[2:4]
File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/indexing.py:1191, in _LocationIndexer.__getitem__(self, key)
1189 maybe_callable = com.apply_if_callable(key, self.obj)
1190 maybe_callable = self._check_deprecated_callable_usage(key, maybe_callable)
-> 1191 return self._getitem_axis(maybe_callable, axis=axis)
File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/indexing.py:1411, in _LocIndexer._getitem_axis(self, key, axis)
1409 if isinstance(key, slice):
1410 self._validate_key(key, axis)
-> 1411 return self._get_slice_axis(key, axis=axis)
1412 elif com.is_bool_indexer(key):
1413 return self._getbool_axis(key, axis=axis)
File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/indexing.py:1443, in _LocIndexer._get_slice_axis(self, slice_obj, axis)
1440 return obj.copy(deep=False)
1442 labels = obj._get_axis(axis)
-> 1443 indexer = labels.slice_indexer(slice_obj.start, slice_obj.stop, slice_obj.step)
1445 if isinstance(indexer, slice):
1446 return self.obj._slice(indexer, axis=axis)
File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/indexes/base.py:6662, in Index.slice_indexer(self, start, end, step)
6618 def slice_indexer(
6619 self,
6620 start: Hashable | None = None,
6621 end: Hashable | None = None,
6622 step: int | None = None,
6623 ) -> slice:
6624 """
6625 Compute the slice indexer for input labels and step.
6626
(...)
6660 slice(1, 3, None)
6661 """
-> 6662 start_slice, end_slice = self.slice_locs(start, end, step=step)
6664 # return a slice
6665 if not is_scalar(start_slice):
File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/indexes/base.py:6879, in Index.slice_locs(self, start, end, step)
6877 start_slice = None
6878 if start is not None:
-> 6879 start_slice = self.get_slice_bound(start, "left")
6880 if start_slice is None:
6881 start_slice = 0
File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/indexes/base.py:6794, in Index.get_slice_bound(self, label, side)
6790 original_label = label
6792 # For datetime indices label may be a string that has to be converted
6793 # to datetime boundary according to its resolution.
-> 6794 label = self._maybe_cast_slice_bound(label, side)
6796 # we need to look up the label
6797 try:
File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/indexes/base.py:6727, in Index._maybe_cast_slice_bound(self, label, side)
6725 # reject them, if index does not contain label
6726 if (is_float(label) or is_integer(label)) and label not in self:
-> 6727 self._raise_invalid_indexer("slice", label)
6729 return label
File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/indexes/base.py:4301, in Index._raise_invalid_indexer(self, form, key, reraise)
4299 if reraise is not lib.no_default:
4300 raise TypeError(msg) from reraise
-> 4301 raise TypeError(msg)
TypeError: cannot do slice indexing on Index with these indexers [2] of type int
In fact we can use slices, but the value in front of the colon (if any) is the label of the first row we want, and the value after the colon is the label of the last row we want.
Notice that this is different from position indexing for arrays, or using
.iloc
. In those cases, the value after the colon gives the stop position.
We return values from positions up to but not including the stop position.
As a reminder:
# From position 2 up to, but not including, position 4
richest_15_with_code.iloc[2:4]
country_name | fert_rate | gdp_us_billion | health_exp_per_cap | health_exp_pub | prim_ed_girls | mat_mort_ratio | population | |
---|---|---|---|---|---|---|---|---|
country_code | ||||||||
JPN | Japan | 1.43 | 5106.024760 | 3687.126279 | 8.496074 | 48.744199 | 5.75 | 127.297102 |
DEU | Germany | 1.45 | 3601.226158 | 4909.659884 | 8.542615 | 48.568695 | 6.25 | 81.281645 |
Now consider this .loc
indexing with a slice. Here the value after the colon
is not a stop value, but an end label. We do return the row
corresponding to the end label:
# All countries starting at label 'JPN' and ending (including)
# row labeled 'FRA'
richest_15_with_code.loc['JPN':'FRA']
country_name | fert_rate | gdp_us_billion | health_exp_per_cap | health_exp_pub | prim_ed_girls | mat_mort_ratio | population | |
---|---|---|---|---|---|---|---|---|
country_code | ||||||||
JPN | Japan | 1.4300 | 5106.024760 | 3687.126279 | 8.496074 | 48.744199 | 5.75 | 127.297102 |
DEU | Germany | 1.4500 | 3601.226158 | 4909.659884 | 8.542615 | 48.568695 | 6.25 | 81.281645 |
GBR | United Kingdom | 1.8425 | 2768.864417 | 3357.983675 | 7.720655 | 48.791809 | 9.25 | 64.641557 |
FRA | France | 2.0050 | 2647.649725 | 4387.835406 | 8.920420 | 48.772050 | 8.75 | 66.302099 |
For reflection - why did Pandas choose to make label indexing work this way? Have a look at the example above.
Loc and iloc indexing work on columns too#
Consider this. We use .loc
and the label for the row, followed by the
label for the column.
richest_15.loc[10, 'gdp_us_billion']
1422.994116449394
This selects the row labeled 10 (the USA row) and returns the value from the
gdp_us_billion
column.
Conversely, iloc
only understands positions, and so, integers. It does not
understand or use labels:
# We can't specify columns by label, using iloc.
richest_15.iloc[10, 'gdp_us_billion']
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/indexing.py:966, in _LocationIndexer._validate_tuple_indexer(self, key)
965 try:
--> 966 self._validate_key(k, i)
967 except ValueError as err:
File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/indexing.py:1614, in _iLocIndexer._validate_key(self, key, axis)
1613 else:
-> 1614 raise ValueError(f"Can only index by location with a [{self._valid_types}]")
ValueError: Can only index by location with a [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array]
The above exception was the direct cause of the following exception:
ValueError Traceback (most recent call last)
Cell In[26], line 2
1 # We can't specify columns by label, using iloc.
----> 2 richest_15.iloc[10, 'gdp_us_billion']
File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/indexing.py:1184, in _LocationIndexer.__getitem__(self, key)
1182 if self._is_scalar_access(key):
1183 return self.obj._get_value(*key, takeable=self._takeable)
-> 1184 return self._getitem_tuple(key)
1185 else:
1186 # we by definition only have the 0th axis
1187 axis = self.axis or 0
File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/indexing.py:1690, in _iLocIndexer._getitem_tuple(self, tup)
1689 def _getitem_tuple(self, tup: tuple):
-> 1690 tup = self._validate_tuple_indexer(tup)
1691 with suppress(IndexingError):
1692 return self._getitem_lowerdim(tup)
File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/indexing.py:968, in _LocationIndexer._validate_tuple_indexer(self, key)
966 self._validate_key(k, i)
967 except ValueError as err:
--> 968 raise ValueError(
969 "Location based indexing can only have "
970 f"[{self._valid_types}] types"
971 ) from err
972 return key
ValueError: Location based indexing can only have [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array] types
Notice the somewhat informative error right at the bottom of the error message.
iloc
can get columns by their position. For example, fertility_rate
is the
third column, and so we can get the fertility rate by asking iloc
for the
eleventh row, third column, like so:
richest_15.iloc[10, 2]
1.6002999999999998
This is the value from the row at offset 10 and the column at offset 2.
Series have an index too#
Data frames have columns (and column labels), and rows (with row labels — the index).
Series also have element labels — they also have an index.
Consider this column, of the fertility rates of the richest countries:
# Get the fert_rate column from the data frame
rich_ferts = richest_15['fert_rate']
rich_ferts.head()
202 1.860875
35 1.558750
97 1.430000
49 1.450000
67 1.842500
Name: fert_rate, dtype: float64
As y’all know, this is a Series:
type(rich_ferts)
pandas.core.series.Series
You probably also spotted the numbers to the left of the values. Yes, the row labels have come across from the data frame into the Series. We can see that the first element has row label 202, and we remember that this is the row label (index value) for the USA.
As for the data frames, here is the index for the series:
rich_ferts.index
Index([202, 35, 97, 49, 67, 63, 26, 88, 94, 164, 32, 10, 104, 58, 124], dtype='int64')
Like data frames, we can index by label, with loc
, and by position, with
iloc
:
# The value for row labeled 10 - Australia.
rich_ferts.loc[10]
1.8615
# The value for the row at position 10 - Canada
rich_ferts.iloc[10]
1.6002999999999998
Rows from data frames are Series#
Here we select the row at position 0 (the first row):
first_row = richest_15.iloc[0]
first_row
country_name United States
country_code USA
fert_rate 1.860875
gdp_us_billion 17369.1246
health_exp_per_cap 9060.068657
health_exp_pub 8.121961
prim_ed_girls 48.75883
mat_mort_ratio 14.0
population 318.558175
Name: 202, dtype: object
This is a Series:
type(first_row)
pandas.core.series.Series
Notice that the Series has 9 elements, one for each column in the data frame.
As you remember, we can see the number of rows and columns in a data frame with:
richest_15.shape
(15, 9)
Remember, the columns have names, which we can also call labels.
When we select a row out of a data frame, we have one element per column, and the labels for the columns now become the labels for the rows.
first_row.index
Index(['country_name', 'country_code', 'fert_rate', 'gdp_us_billion',
'health_exp_per_cap', 'health_exp_pub', 'prim_ed_girls',
'mat_mort_ratio', 'population'],
dtype='object')
As for any other Series, we can get elements by label, with loc
:
first_row.loc['fert_rate']
1.860875
We can also get elements by position (offset), with iloc
:
first_row.iloc[0]
'United States'
You can index without loc and iloc, but please don’t, except …#
In the pages following the data frame
introduction, you saw that we can index into
data frame directly, using [ something ]
after the data frame value, without
using loc
or iloc
. Call this direct indexing. This is a very reasonable
thing to do if:
the
something
is a Boolean Series (DIBS). This is not ambiguous.the
something
is a column label or list of labels (DICL). This isn’t ambiguous either.
Both of these are entirely reasonable uses of direct indexing:
# Direct indexing with a Boolean Series (DIBS).
# Select rows where 'health_exp_per_cap' is greater than 3000 dollars.
is_big_spender = richest_15['health_exp_per_cap'] > 3000 # Boolean Series
big_health_spenders = richest_15[is_big_spender]
big_health_spenders
country_name | country_code | fert_rate | gdp_us_billion | health_exp_per_cap | health_exp_pub | prim_ed_girls | mat_mort_ratio | population | |
---|---|---|---|---|---|---|---|---|---|
202 | United States | USA | 1.860875 | 17369.124600 | 9060.068657 | 8.121961 | 48.758830 | 14.00 | 318.558175 |
97 | Japan | JPN | 1.430000 | 5106.024760 | 3687.126279 | 8.496074 | 48.744199 | 5.75 | 127.297102 |
49 | Germany | DEU | 1.450000 | 3601.226158 | 4909.659884 | 8.542615 | 48.568695 | 6.25 | 81.281645 |
67 | United Kingdom | GBR | 1.842500 | 2768.864417 | 3357.983675 | 7.720655 | 48.791809 | 9.25 | 64.641557 |
63 | France | FRA | 2.005000 | 2647.649725 | 4387.835406 | 8.920420 | 48.772050 | 8.75 | 66.302099 |
94 | Italy | ITA | 1.390000 | 2005.983980 | 3266.984094 | 6.984374 | 48.407573 | 4.00 | 60.378795 |
32 | Canada | CAN | 1.600300 | 1708.473627 | 4616.539134 | 7.546247 | 48.808926 | 7.25 | 35.517119 |
10 | Australia | AUS | 1.861500 | 1422.994116 | 4256.058988 | 6.292381 | 48.576707 | 6.00 | 23.444560 |
# Direct indexing with column labels (DICL).
# Select the 'country_code' column.
country_code = richest_15['country_code']
country_code
202 USA
35 CHN
97 JPN
49 DEU
67 GBR
63 FRA
26 BRA
88 IND
94 ITA
164 RUS
32 CAN
10 AUS
104 KOR
58 ESP
124 MEX
Name: country_code, dtype: object
Those uses of direct indexing are entirely sensible. Other uses of direct indexing are not sensible, and we strongly encourage you not to use them.
For example, Pandas will also allow you to this, with the Series we just made:
# Nooooooo !
country_code[10]
'AUS'
Please don’t do that. This is ambiguous. Did you really mean the country with label 10? Or did you mean the country at position 10? Pandas will guess for you. Don’t let it. If you are not selecting rows / elements with Booleans, always do this:
# By label
country_code.loc[10]
'AUS'
or this:
# By position
country_code.iloc[10]
'CAN'
You can use Boolean indexing with .loc
#
It is often convenient to use direct indexing with Booleans, as we have just seen:
# Create a Boolean series with True for big spender rows, False otherwise.
is_big_spender = richest_15['health_exp_per_cap'] > 3000
is_big_spender
202 True
35 False
97 True
49 True
67 True
63 True
26 False
88 False
94 True
164 False
32 True
10 True
104 False
58 False
124 False
Name: health_exp_per_cap, dtype: bool
# Direct Boolean indexing:
# Select rows where 'health_exp_per_cap' is greater than 3000 dollars.
big_health_spenders = richest_15[is_big_spender]
big_health_spenders
country_name | country_code | fert_rate | gdp_us_billion | health_exp_per_cap | health_exp_pub | prim_ed_girls | mat_mort_ratio | population | |
---|---|---|---|---|---|---|---|---|---|
202 | United States | USA | 1.860875 | 17369.124600 | 9060.068657 | 8.121961 | 48.758830 | 14.00 | 318.558175 |
97 | Japan | JPN | 1.430000 | 5106.024760 | 3687.126279 | 8.496074 | 48.744199 | 5.75 | 127.297102 |
49 | Germany | DEU | 1.450000 | 3601.226158 | 4909.659884 | 8.542615 | 48.568695 | 6.25 | 81.281645 |
67 | United Kingdom | GBR | 1.842500 | 2768.864417 | 3357.983675 | 7.720655 | 48.791809 | 9.25 | 64.641557 |
63 | France | FRA | 2.005000 | 2647.649725 | 4387.835406 | 8.920420 | 48.772050 | 8.75 | 66.302099 |
94 | Italy | ITA | 1.390000 | 2005.983980 | 3266.984094 | 6.984374 | 48.407573 | 4.00 | 60.378795 |
32 | Canada | CAN | 1.600300 | 1708.473627 | 4616.539134 | 7.546247 | 48.808926 | 7.25 | 35.517119 |
10 | Australia | AUS | 1.861500 | 1422.994116 | 4256.058988 | 6.292381 | 48.576707 | 6.00 | 23.444560 |
We can also use Booleans with loc
indexing. This allows us to do things such
as selecting rows with Booleans, and columns with labels, in one shot, like
this:
# Boolean indexing for rows, with "loc"
# Get "country_name" column values for big health spenders:
big_spender_names = richest_15.loc[is_big_spender, 'country_name']
big_spender_names
202 United States
97 Japan
49 Germany
67 United Kingdom
63 France
94 Italy
32 Canada
10 Australia
Name: country_name, dtype: object
In fact you can use some Boolean sequences with iloc
as well, but that needs
a little more explanation.