The doctrine of Pandas indexing#

# Load the Numpy library, rename to "np"
import numpy as np

# Load the Pandas data science library, rename to "pd"
import pandas as pd

# Safe setting for Pandas.  Needs Pandas version >= 1.5.
pd.set_option('mode.copy_on_write', True)
# Read the CSV to a DataFrame.
ratings = pd.read_csv('rate_my_course.csv')
# Change the default numerical row labels to be the discipline name.
named_ratings = ratings.set_index('Discipline')
named_ratings.head()
Number of Professors Clarity Helpfulness Overall Quality Easiness
Discipline
English 23343 3.756147 3.821866 3.791364 3.162754
Mathematics 22394 3.487379 3.641526 3.566867 3.063322
Biology 11774 3.608331 3.701530 3.657641 2.710459
Psychology 11179 3.909520 3.887536 3.900949 3.316210
History 11145 3.788818 3.753642 3.773746 3.053803

The catechism#

A catechism is an exposition of doctrine. Here we exposit the doctrine of Pandas indexing. Pronounce catechism as kata-kism.

There are two routes to Pandas indexing:

  • Direct indexing (square brackets follow the value).

  • Indirect indexing (square brackets follow .loc or .iloc.

Direct indexing - keep it Simple#

When using direct indexing you should restrict yourself to simple indexing. Simple indexing is the name we use when you use only one of two types of things within the square brackets:

  • Column labels (direct indexing with column labels - DICL).

  • A Boolean Series (direct indexing with Boolean Series - DIBS).

Direct indexing with Column Labels#

# DICL - one column label between square brackets gives a Series.
easiness = named_ratings['Easiness']
easiness
Discipline
English                   3.162754
Mathematics               3.063322
Biology                   2.710459
Psychology                3.316210
History                   3.053803
                            ...   
Anatomy                   2.863504
Earth Science             3.106727
Linguistics               3.309636
Mechanical Engineering    2.799135
Medicine                  3.109118
Name: Easiness, Length: 75, dtype: float64
# DICL again - more than one column label gives a DataFrame
column_labels = ['Clarity', 'Easiness']
thinner_ratings = named_ratings[column_labels]
thinner_ratings
Clarity Easiness
Discipline
English 3.756147 3.162754
Mathematics 3.487379 3.063322
Biology 3.608331 2.710459
Psychology 3.909520 3.316210
History 3.788818 3.053803
... ... ...
Anatomy 3.932991 2.863504
Earth Science 3.636182 3.106727
Linguistics 3.749000 3.309636
Mechanical Engineering 3.441923 2.799135
Medicine 3.927255 3.109118

75 rows × 2 columns

Direct indexing with Boolean Series#

# Create a Boolean Series by doing a comparison on a Series
is_easy = easiness > 3.5
is_easy
Discipline
English                   False
Mathematics               False
Biology                   False
Psychology                False
History                   False
                          ...  
Anatomy                   False
Earth Science             False
Linguistics               False
Mechanical Engineering    False
Medicine                  False
Name: Easiness, Length: 75, dtype: bool
# DIBS selects *rows*, giving a DataFrame with one row per True
# element in the Boolean Series
easy_ratings = named_ratings[is_easy]
easy_ratings
Number of Professors Clarity Helpfulness Overall Quality Easiness
Discipline
Music 2455 3.844509 3.787804 3.818114 3.542273
Theater 1078 3.876633 3.821503 3.851837 3.584508
Physical Education 991 4.078698 4.030797 4.057719 3.832250
Speech 351 4.133191 4.101197 4.119345 3.674701
Child Development 171 3.950585 4.008070 3.979766 3.606082
Reading 148 4.159392 4.188919 4.177500 3.882635
Kinesiology 132 3.995000 3.972879 3.988712 3.541439

Indirect indexing - two indexing targets#

You can target (look at) the row and column labels using .loc (think L for Label indexing).

You can target (look at) the row and column positions using .iloc (think I for Integer indexing).

.loc gives indirect indexing targeting labels#

# Select a single row by label, to get a Series.
named_ratings.loc['English']
Number of Professors    23343.000000
Clarity                     3.756147
Helpfulness                 3.821866
Overall Quality             3.791364
Easiness                    3.162754
Name: English, dtype: float64
# Select a single value by specifying the row and column label.
named_ratings.loc['English', 'Easiness']
3.16275414471149
# Select a column by label, asking for all rows (the ':').
named_ratings.loc[:, 'Easiness']
Discipline
English                   3.162754
Mathematics               3.063322
Biology                   2.710459
Psychology                3.316210
History                   3.053803
                            ...   
Anatomy                   2.863504
Earth Science             3.106727
Linguistics               3.309636
Mechanical Engineering    2.799135
Medicine                  3.109118
Name: Easiness, Length: 75, dtype: float64

You can select values with Boolean Series using indirect indexing:

# Select rows where Easiness > 3.5 using .loc with Boolean Series.
named_ratings.loc[is_easy]
Number of Professors Clarity Helpfulness Overall Quality Easiness
Discipline
Music 2455 3.844509 3.787804 3.818114 3.542273
Theater 1078 3.876633 3.821503 3.851837 3.584508
Physical Education 991 4.078698 4.030797 4.057719 3.832250
Speech 351 4.133191 4.101197 4.119345 3.674701
Child Development 171 3.950585 4.008070 3.979766 3.606082
Reading 148 4.159392 4.188919 4.177500 3.882635
Kinesiology 132 3.995000 3.972879 3.988712 3.541439
# Select Clarity column from rows where Easiness > 3.5
named_ratings.loc[is_easy, 'Clarity']
Discipline
Music                 3.844509
Theater               3.876633
Physical Education    4.078698
Speech                4.133191
Child Development     3.950585
Reading               4.159392
Kinesiology           3.995000
Name: Clarity, dtype: float64

.iloc gives indirect indexing targeting positions#

# Select a single row by position, to get a Series.
# The first row (by position) is English.
named_ratings.iloc[0]
Number of Professors    23343.000000
Clarity                     3.756147
Helpfulness                 3.821866
Overall Quality             3.791364
Easiness                    3.162754
Name: English, dtype: float64
# Select a single value by specifying the row and column positions.
# Easiness is the last column (position -1).
named_ratings.iloc[0, -1]
3.16275414471149
# Select a column by position, asking for all rows (the ':').
named_ratings.iloc[:, -1]
Discipline
English                   3.162754
Mathematics               3.063322
Biology                   2.710459
Psychology                3.316210
History                   3.053803
                            ...   
Anatomy                   2.863504
Earth Science             3.106727
Linguistics               3.309636
Mechanical Engineering    2.799135
Medicine                  3.109118
Name: Easiness, Length: 75, dtype: float64

In fact, you can’t use Boolean Series for indexing with .iloc, because Pandas won’t let you use a thing with labels for iloc indexing.

# You can't use Boolean Series for iloc indexing.
named_ratings.iloc[is_easy]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[15], line 2
      1 # You can't use Boolean Series for iloc indexing.
----> 2 named_ratings.iloc[is_easy]

File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/indexing.py:1191, in _LocationIndexer.__getitem__(self, key)
   1189 maybe_callable = com.apply_if_callable(key, self.obj)
   1190 maybe_callable = self._check_deprecated_callable_usage(key, maybe_callable)
-> 1191 return self._getitem_axis(maybe_callable, axis=axis)

File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/indexing.py:1738, in _iLocIndexer._getitem_axis(self, key, axis)
   1735     key = np.asarray(key)
   1737 if com.is_bool_indexer(key):
-> 1738     self._validate_key(key, axis)
   1739     return self._getbool_axis(key, axis=axis)
   1741 # a list of integers

File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/indexing.py:1583, in _iLocIndexer._validate_key(self, key, axis)
   1577         if key.index.inferred_type == "integer":
   1578             raise NotImplementedError(
   1579                 "iLocation based boolean "
   1580                 "indexing on an integer type "
   1581                 "is not available"
   1582             )
-> 1583         raise ValueError(
   1584             "iLocation based boolean indexing cannot use "
   1585             "an indexable as a mask"
   1586         )
   1587     return
   1589 if isinstance(key, slice):

ValueError: iLocation based boolean indexing cannot use an indexable as a mask

If you really want to do Boolean indexing with .iloc, strip the labels (the index) off the thing you are using for indexing:

is_easy_arr = np.array(is_easy)   # Strip off the labels.
named_ratings.iloc[is_easy_arr]
Number of Professors Clarity Helpfulness Overall Quality Easiness
Discipline
Music 2455 3.844509 3.787804 3.818114 3.542273
Theater 1078 3.876633 3.821503 3.851837 3.584508
Physical Education 991 4.078698 4.030797 4.057719 3.832250
Speech 351 4.133191 4.101197 4.119345 3.674701
Child Development 171 3.950585 4.008070 3.979766 3.606082
Reading 148 4.159392 4.188919 4.177500 3.882635
Kinesiology 132 3.995000 3.972879 3.988712 3.541439

Trap! Alarm! The row labels can be integers#

Unless you tell it otherwise, Pandas will default to giving rows labels that are integers, rather than names. Therefore your row labels will often be integers. That can be confusing, because the labels can look like positions, when they are not.

# The default row labels are sequential integers.
ratings = pd.read_csv('rate_my_course.csv')
ratings
Discipline Number of Professors Clarity Helpfulness Overall Quality Easiness
0 English 23343 3.756147 3.821866 3.791364 3.162754
1 Mathematics 22394 3.487379 3.641526 3.566867 3.063322
2 Biology 11774 3.608331 3.701530 3.657641 2.710459
3 Psychology 11179 3.909520 3.887536 3.900949 3.316210
4 History 11145 3.788818 3.753642 3.773746 3.053803
... ... ... ... ... ... ...
70 Anatomy 117 3.932991 3.974701 3.954188 2.863504
71 Earth Science 110 3.636182 3.671364 3.655091 3.106727
72 Linguistics 110 3.749000 3.834545 3.798182 3.309636
73 Mechanical Engineering 104 3.441923 3.531154 3.489327 2.799135
74 Medicine 102 3.927255 3.934216 3.929118 3.109118

75 rows × 6 columns

The numbers are labels and not positions, but we can’t see that here, because the labels and the positions are the same. You can show the numbers are labels by selecting rows, or sorting the DataFrame. Let’s sort the DataFrame by ‘Easiness’ to demonstrate.

# Notice that row labels are not sequential any more.
# The labels stick to the row they started with.
ratings_by_easy = ratings.sort_values('Easiness')
ratings_by_easy
Discipline Number of Professors Clarity Helpfulness Overall Quality Easiness
5 Chemistry 7346 3.387174 3.538980 3.465485 2.652054
40 Nursing 641 3.613385 3.608081 3.613573 2.676069
33 Finance 1042 3.346536 3.443301 3.396823 2.693512
2 Biology 11774 3.608331 3.701530 3.657641 2.710459
13 Accounting 3895 3.484249 3.572727 3.531315 2.784706
... ... ... ... ... ... ...
31 Theater 1078 3.876633 3.821503 3.851837 3.584508
61 Child Development 171 3.950585 4.008070 3.979766 3.606082
49 Speech 351 4.133191 4.101197 4.119345 3.674701
36 Physical Education 991 4.078698 4.030797 4.057719 3.832250
64 Reading 148 4.159392 4.188919 4.177500 3.882635

75 rows × 6 columns

Distinguish between label and position using .loc and .iloc:

# The row *labeled* with the number 2.
ratings_by_easy.loc[2]
Discipline               Biology
Number of Professors       11774
Clarity                 3.608331
Helpfulness              3.70153
Overall Quality         3.657641
Easiness                2.710459
Name: 2, dtype: object
# The row at *position* 2 (the third row).
ratings_by_easy.iloc[2]
Discipline               Finance
Number of Professors        1042
Clarity                 3.346536
Helpfulness             3.443301
Overall Quality         3.396823
Easiness                2.693512
Name: 33, dtype: object

Use indirect indexing to select values from Series#

You can also index into Series values. Use indirect indexing to do this.

# Select by label.
easiness.loc['English']
3.16275414471149
# Select using Boolean Series.
easiness.loc[is_easy]
Discipline
Music                 3.542273
Theater               3.584508
Physical Education    3.832250
Speech                3.674701
Child Development     3.606082
Reading               3.882635
Kinesiology           3.541439
Name: Easiness, dtype: float64
# Select by position.
# English is the first value (at position 0).
easiness.iloc[0]
3.16275414471149