The doctrine of Pandas indexing#
# Load the Numpy library, rename to "np"
import numpy as np
# Load the Pandas data science library, rename to "pd"
import pandas as pd
# Safe setting for Pandas. Needs Pandas version >= 1.5.
pd.set_option('mode.copy_on_write', True)
# Read the CSV to a DataFrame.
ratings = pd.read_csv('rate_my_course.csv')
# Change the default numerical row labels to be the discipline name.
named_ratings = ratings.set_index('Discipline')
named_ratings.head()
Number of Professors | Clarity | Helpfulness | Overall Quality | Easiness | |
---|---|---|---|---|---|
Discipline | |||||
English | 23343 | 3.756147 | 3.821866 | 3.791364 | 3.162754 |
Mathematics | 22394 | 3.487379 | 3.641526 | 3.566867 | 3.063322 |
Biology | 11774 | 3.608331 | 3.701530 | 3.657641 | 2.710459 |
Psychology | 11179 | 3.909520 | 3.887536 | 3.900949 | 3.316210 |
History | 11145 | 3.788818 | 3.753642 | 3.773746 | 3.053803 |
The catechism#
A catechism is an exposition of doctrine. Here we exposit the doctrine of Pandas indexing. Pronounce catechism as kata-kism.
There are two routes to Pandas indexing:
Direct indexing (square brackets follow the value).
Indirect indexing (square brackets follow
.loc
or.iloc
.
Direct indexing - keep it Simple#
When using direct indexing you should restrict yourself to simple indexing. Simple indexing is the name we use when you use only one of two types of things within the square brackets:
Column labels (direct indexing with column labels - DICL).
A Boolean Series (direct indexing with Boolean Series - DIBS).
Direct indexing with Column Labels#
# DICL - one column label between square brackets gives a Series.
easiness = named_ratings['Easiness']
easiness
Discipline
English 3.162754
Mathematics 3.063322
Biology 2.710459
Psychology 3.316210
History 3.053803
...
Anatomy 2.863504
Earth Science 3.106727
Linguistics 3.309636
Mechanical Engineering 2.799135
Medicine 3.109118
Name: Easiness, Length: 75, dtype: float64
# DICL again - more than one column label gives a DataFrame
column_labels = ['Clarity', 'Easiness']
thinner_ratings = named_ratings[column_labels]
thinner_ratings
Clarity | Easiness | |
---|---|---|
Discipline | ||
English | 3.756147 | 3.162754 |
Mathematics | 3.487379 | 3.063322 |
Biology | 3.608331 | 2.710459 |
Psychology | 3.909520 | 3.316210 |
History | 3.788818 | 3.053803 |
... | ... | ... |
Anatomy | 3.932991 | 2.863504 |
Earth Science | 3.636182 | 3.106727 |
Linguistics | 3.749000 | 3.309636 |
Mechanical Engineering | 3.441923 | 2.799135 |
Medicine | 3.927255 | 3.109118 |
75 rows × 2 columns
Direct indexing with Boolean Series#
# Create a Boolean Series by doing a comparison on a Series
is_easy = easiness > 3.5
is_easy
Discipline
English False
Mathematics False
Biology False
Psychology False
History False
...
Anatomy False
Earth Science False
Linguistics False
Mechanical Engineering False
Medicine False
Name: Easiness, Length: 75, dtype: bool
# DIBS selects *rows*, giving a DataFrame with one row per True
# element in the Boolean Series
easy_ratings = named_ratings[is_easy]
easy_ratings
Number of Professors | Clarity | Helpfulness | Overall Quality | Easiness | |
---|---|---|---|---|---|
Discipline | |||||
Music | 2455 | 3.844509 | 3.787804 | 3.818114 | 3.542273 |
Theater | 1078 | 3.876633 | 3.821503 | 3.851837 | 3.584508 |
Physical Education | 991 | 4.078698 | 4.030797 | 4.057719 | 3.832250 |
Speech | 351 | 4.133191 | 4.101197 | 4.119345 | 3.674701 |
Child Development | 171 | 3.950585 | 4.008070 | 3.979766 | 3.606082 |
Reading | 148 | 4.159392 | 4.188919 | 4.177500 | 3.882635 |
Kinesiology | 132 | 3.995000 | 3.972879 | 3.988712 | 3.541439 |
Indirect indexing - two indexing targets#
You can target (look at) the row and column labels using .loc
(think L for Label indexing).
You can target (look at) the row and column positions using .iloc
(think I for Integer indexing).
.loc
gives indirect indexing targeting labels#
# Select a single row by label, to get a Series.
named_ratings.loc['English']
Number of Professors 23343.000000
Clarity 3.756147
Helpfulness 3.821866
Overall Quality 3.791364
Easiness 3.162754
Name: English, dtype: float64
# Select a single value by specifying the row and column label.
named_ratings.loc['English', 'Easiness']
3.16275414471149
# Select a column by label, asking for all rows (the ':').
named_ratings.loc[:, 'Easiness']
Discipline
English 3.162754
Mathematics 3.063322
Biology 2.710459
Psychology 3.316210
History 3.053803
...
Anatomy 2.863504
Earth Science 3.106727
Linguistics 3.309636
Mechanical Engineering 2.799135
Medicine 3.109118
Name: Easiness, Length: 75, dtype: float64
You can select values with Boolean Series using indirect indexing:
# Select rows where Easiness > 3.5 using .loc with Boolean Series.
named_ratings.loc[is_easy]
Number of Professors | Clarity | Helpfulness | Overall Quality | Easiness | |
---|---|---|---|---|---|
Discipline | |||||
Music | 2455 | 3.844509 | 3.787804 | 3.818114 | 3.542273 |
Theater | 1078 | 3.876633 | 3.821503 | 3.851837 | 3.584508 |
Physical Education | 991 | 4.078698 | 4.030797 | 4.057719 | 3.832250 |
Speech | 351 | 4.133191 | 4.101197 | 4.119345 | 3.674701 |
Child Development | 171 | 3.950585 | 4.008070 | 3.979766 | 3.606082 |
Reading | 148 | 4.159392 | 4.188919 | 4.177500 | 3.882635 |
Kinesiology | 132 | 3.995000 | 3.972879 | 3.988712 | 3.541439 |
# Select Clarity column from rows where Easiness > 3.5
named_ratings.loc[is_easy, 'Clarity']
Discipline
Music 3.844509
Theater 3.876633
Physical Education 4.078698
Speech 4.133191
Child Development 3.950585
Reading 4.159392
Kinesiology 3.995000
Name: Clarity, dtype: float64
.iloc
gives indirect indexing targeting positions#
# Select a single row by position, to get a Series.
# The first row (by position) is English.
named_ratings.iloc[0]
Number of Professors 23343.000000
Clarity 3.756147
Helpfulness 3.821866
Overall Quality 3.791364
Easiness 3.162754
Name: English, dtype: float64
# Select a single value by specifying the row and column positions.
# Easiness is the last column (position -1).
named_ratings.iloc[0, -1]
3.16275414471149
# Select a column by position, asking for all rows (the ':').
named_ratings.iloc[:, -1]
Discipline
English 3.162754
Mathematics 3.063322
Biology 2.710459
Psychology 3.316210
History 3.053803
...
Anatomy 2.863504
Earth Science 3.106727
Linguistics 3.309636
Mechanical Engineering 2.799135
Medicine 3.109118
Name: Easiness, Length: 75, dtype: float64
In fact, you can’t use Boolean Series for indexing with .iloc
, because Pandas
won’t let you use a thing with labels for iloc
indexing.
# You can't use Boolean Series for iloc indexing.
named_ratings.iloc[is_easy]
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[15], line 2
1 # You can't use Boolean Series for iloc indexing.
----> 2 named_ratings.iloc[is_easy]
File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/indexing.py:1191, in _LocationIndexer.__getitem__(self, key)
1189 maybe_callable = com.apply_if_callable(key, self.obj)
1190 maybe_callable = self._check_deprecated_callable_usage(key, maybe_callable)
-> 1191 return self._getitem_axis(maybe_callable, axis=axis)
File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/indexing.py:1738, in _iLocIndexer._getitem_axis(self, key, axis)
1735 key = np.asarray(key)
1737 if com.is_bool_indexer(key):
-> 1738 self._validate_key(key, axis)
1739 return self._getbool_axis(key, axis=axis)
1741 # a list of integers
File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/indexing.py:1583, in _iLocIndexer._validate_key(self, key, axis)
1577 if key.index.inferred_type == "integer":
1578 raise NotImplementedError(
1579 "iLocation based boolean "
1580 "indexing on an integer type "
1581 "is not available"
1582 )
-> 1583 raise ValueError(
1584 "iLocation based boolean indexing cannot use "
1585 "an indexable as a mask"
1586 )
1587 return
1589 if isinstance(key, slice):
ValueError: iLocation based boolean indexing cannot use an indexable as a mask
If you really want to do Boolean indexing with .iloc
, strip the labels (the
index
) off the thing you are using for indexing:
is_easy_arr = np.array(is_easy) # Strip off the labels.
named_ratings.iloc[is_easy_arr]
Number of Professors | Clarity | Helpfulness | Overall Quality | Easiness | |
---|---|---|---|---|---|
Discipline | |||||
Music | 2455 | 3.844509 | 3.787804 | 3.818114 | 3.542273 |
Theater | 1078 | 3.876633 | 3.821503 | 3.851837 | 3.584508 |
Physical Education | 991 | 4.078698 | 4.030797 | 4.057719 | 3.832250 |
Speech | 351 | 4.133191 | 4.101197 | 4.119345 | 3.674701 |
Child Development | 171 | 3.950585 | 4.008070 | 3.979766 | 3.606082 |
Reading | 148 | 4.159392 | 4.188919 | 4.177500 | 3.882635 |
Kinesiology | 132 | 3.995000 | 3.972879 | 3.988712 | 3.541439 |
Trap! Alarm! The row labels can be integers#
Unless you tell it otherwise, Pandas will default to giving rows labels that are integers, rather than names. Therefore your row labels will often be integers. That can be confusing, because the labels can look like positions, when they are not.
# The default row labels are sequential integers.
ratings = pd.read_csv('rate_my_course.csv')
ratings
Discipline | Number of Professors | Clarity | Helpfulness | Overall Quality | Easiness | |
---|---|---|---|---|---|---|
0 | English | 23343 | 3.756147 | 3.821866 | 3.791364 | 3.162754 |
1 | Mathematics | 22394 | 3.487379 | 3.641526 | 3.566867 | 3.063322 |
2 | Biology | 11774 | 3.608331 | 3.701530 | 3.657641 | 2.710459 |
3 | Psychology | 11179 | 3.909520 | 3.887536 | 3.900949 | 3.316210 |
4 | History | 11145 | 3.788818 | 3.753642 | 3.773746 | 3.053803 |
... | ... | ... | ... | ... | ... | ... |
70 | Anatomy | 117 | 3.932991 | 3.974701 | 3.954188 | 2.863504 |
71 | Earth Science | 110 | 3.636182 | 3.671364 | 3.655091 | 3.106727 |
72 | Linguistics | 110 | 3.749000 | 3.834545 | 3.798182 | 3.309636 |
73 | Mechanical Engineering | 104 | 3.441923 | 3.531154 | 3.489327 | 2.799135 |
74 | Medicine | 102 | 3.927255 | 3.934216 | 3.929118 | 3.109118 |
75 rows × 6 columns
The numbers are labels and not positions, but we can’t see that here, because the labels and the positions are the same. You can show the numbers are labels by selecting rows, or sorting the DataFrame. Let’s sort the DataFrame by ‘Easiness’ to demonstrate.
# Notice that row labels are not sequential any more.
# The labels stick to the row they started with.
ratings_by_easy = ratings.sort_values('Easiness')
ratings_by_easy
Discipline | Number of Professors | Clarity | Helpfulness | Overall Quality | Easiness | |
---|---|---|---|---|---|---|
5 | Chemistry | 7346 | 3.387174 | 3.538980 | 3.465485 | 2.652054 |
40 | Nursing | 641 | 3.613385 | 3.608081 | 3.613573 | 2.676069 |
33 | Finance | 1042 | 3.346536 | 3.443301 | 3.396823 | 2.693512 |
2 | Biology | 11774 | 3.608331 | 3.701530 | 3.657641 | 2.710459 |
13 | Accounting | 3895 | 3.484249 | 3.572727 | 3.531315 | 2.784706 |
... | ... | ... | ... | ... | ... | ... |
31 | Theater | 1078 | 3.876633 | 3.821503 | 3.851837 | 3.584508 |
61 | Child Development | 171 | 3.950585 | 4.008070 | 3.979766 | 3.606082 |
49 | Speech | 351 | 4.133191 | 4.101197 | 4.119345 | 3.674701 |
36 | Physical Education | 991 | 4.078698 | 4.030797 | 4.057719 | 3.832250 |
64 | Reading | 148 | 4.159392 | 4.188919 | 4.177500 | 3.882635 |
75 rows × 6 columns
Distinguish between label and position using .loc
and .iloc
:
# The row *labeled* with the number 2.
ratings_by_easy.loc[2]
Discipline Biology
Number of Professors 11774
Clarity 3.608331
Helpfulness 3.70153
Overall Quality 3.657641
Easiness 2.710459
Name: 2, dtype: object
# The row at *position* 2 (the third row).
ratings_by_easy.iloc[2]
Discipline Finance
Number of Professors 1042
Clarity 3.346536
Helpfulness 3.443301
Overall Quality 3.396823
Easiness 2.693512
Name: 33, dtype: object
Use indirect indexing to select values from Series#
You can also index into Series values. Use indirect indexing to do this.
# Select by label.
easiness.loc['English']
3.16275414471149
# Select using Boolean Series.
easiness.loc[is_easy]
Discipline
Music 3.542273
Theater 3.584508
Physical Education 3.832250
Speech 3.674701
Child Development 3.606082
Reading 3.882635
Kinesiology 3.541439
Name: Easiness, dtype: float64
# Select by position.
# English is the first value (at position 0).
easiness.iloc[0]
3.16275414471149