The doctrine of Pandas indexing

The doctrine of Pandas indexing#

# Load the Numpy library, rename to "np"
import numpy as np

# Load the Pandas data science library, rename to "pd"
import pandas as pd

# Safe setting for Pandas.  Needs Pandas version >= 1.5.
pd.set_option('mode.copy_on_write', True)

# Read the CSV to a DataFrame.
ratings = pd.read_csv('rate_my_course.csv')
# Change the default numerical row labels to be the discipline name.
named_ratings = ratings.set_index('Discipline')
named_ratings.head()

	Number of Professors	Clarity	Helpfulness	Overall Quality	Easiness
Discipline
English	23343	3.756147	3.821866	3.791364	3.162754
Mathematics	22394	3.487379	3.641526	3.566867	3.063322
Biology	11774	3.608331	3.701530	3.657641	2.710459
Psychology	11179	3.909520	3.887536	3.900949	3.316210
History	11145	3.788818	3.753642	3.773746	3.053803

The catechism#

A catechism is an exposition of doctrine. Here we exposit the doctrine of Pandas indexing. Pronounce catechism as kata-kism.

There are two routes to Pandas indexing:

Direct indexing (square brackets follow the value).
Indirect indexing (square brackets follow .loc or .iloc.

Direct indexing - keep it Simple#

When using direct indexing you should restrict yourself to simple indexing. Simple indexing is the name we use when you use only one of two types of things within the square brackets:

Column labels (direct indexing with column labels - DICL).
A Boolean Series (direct indexing with Boolean Series - DIBS).

Direct indexing with Column Labels#

# DICL - one column label between square brackets gives a Series.
easiness = named_ratings['Easiness']
easiness

Discipline
English                   3.162754
Mathematics               3.063322
Biology                   2.710459
Psychology                3.316210
History                   3.053803
                            ...   
Anatomy                   2.863504
Earth Science             3.106727
Linguistics               3.309636
Mechanical Engineering    2.799135
Medicine                  3.109118
Name: Easiness, Length: 75, dtype: float64

# DICL again - more than one column label gives a DataFrame
column_labels = ['Clarity', 'Easiness']
thinner_ratings = named_ratings[column_labels]
thinner_ratings

	Clarity	Easiness
Discipline
English	3.756147	3.162754
Mathematics	3.487379	3.063322
Biology	3.608331	2.710459
Psychology	3.909520	3.316210
History	3.788818	3.053803
...	...	...
Anatomy	3.932991	2.863504
Earth Science	3.636182	3.106727
Linguistics	3.749000	3.309636
Mechanical Engineering	3.441923	2.799135
Medicine	3.927255	3.109118

75 rows × 2 columns

Direct indexing with Boolean Series#

# Create a Boolean Series by doing a comparison on a Series
is_easy = easiness > 3.5
is_easy

Discipline
English                   False
Mathematics               False
Biology                   False
Psychology                False
History                   False
                          ...  
Anatomy                   False
Earth Science             False
Linguistics               False
Mechanical Engineering    False
Medicine                  False
Name: Easiness, Length: 75, dtype: bool

# DIBS selects *rows*, giving a DataFrame with one row per True
# element in the Boolean Series
easy_ratings = named_ratings[is_easy]
easy_ratings

	Number of Professors	Clarity	Helpfulness	Overall Quality	Easiness
Discipline
Music	2455	3.844509	3.787804	3.818114	3.542273
Theater	1078	3.876633	3.821503	3.851837	3.584508
Physical Education	991	4.078698	4.030797	4.057719	3.832250
Speech	351	4.133191	4.101197	4.119345	3.674701
Child Development	171	3.950585	4.008070	3.979766	3.606082
Reading	148	4.159392	4.188919	4.177500	3.882635
Kinesiology	132	3.995000	3.972879	3.988712	3.541439

Indirect indexing - two indexing targets#

You can target (look at) the row and column labels using .loc (think L for Label indexing).

You can target (look at) the row and column positions using .iloc (think I for Integer indexing).

`.loc` gives indirect indexing targeting labels#

# Select a single row by label, to get a Series.
named_ratings.loc['English']

Number of Professors    23343.000000
Clarity                     3.756147
Helpfulness                 3.821866
Overall Quality             3.791364
Easiness                    3.162754
Name: English, dtype: float64

# Select a single value by specifying the row and column label.
named_ratings.loc['English', 'Easiness']

3.16275414471149

# Select a column by label, asking for all rows (the ':').
named_ratings.loc[:, 'Easiness']

Discipline
English                   3.162754
Mathematics               3.063322
Biology                   2.710459
Psychology                3.316210
History                   3.053803
                            ...   
Anatomy                   2.863504
Earth Science             3.106727
Linguistics               3.309636
Mechanical Engineering    2.799135
Medicine                  3.109118
Name: Easiness, Length: 75, dtype: float64

You can select values with Boolean Series using indirect indexing:

# Select rows where Easiness > 3.5 using .loc with Boolean Series.
named_ratings.loc[is_easy]

	Number of Professors	Clarity	Helpfulness	Overall Quality	Easiness
Discipline
Music	2455	3.844509	3.787804	3.818114	3.542273
Theater	1078	3.876633	3.821503	3.851837	3.584508
Physical Education	991	4.078698	4.030797	4.057719	3.832250
Speech	351	4.133191	4.101197	4.119345	3.674701
Child Development	171	3.950585	4.008070	3.979766	3.606082
Reading	148	4.159392	4.188919	4.177500	3.882635
Kinesiology	132	3.995000	3.972879	3.988712	3.541439

# Select Clarity column from rows where Easiness > 3.5
named_ratings.loc[is_easy, 'Clarity']

Discipline
Music                 3.844509
Theater               3.876633
Physical Education    4.078698
Speech                4.133191
Child Development     3.950585
Reading               4.159392
Kinesiology           3.995000
Name: Clarity, dtype: float64

`.iloc` gives indirect indexing targeting positions#

# Select a single row by position, to get a Series.
# The first row (by position) is English.
named_ratings.iloc[0]

Number of Professors    23343.000000
Clarity                     3.756147
Helpfulness                 3.821866
Overall Quality             3.791364
Easiness                    3.162754
Name: English, dtype: float64

# Select a single value by specifying the row and column positions.
# Easiness is the last column (position -1).
named_ratings.iloc[0, -1]

3.16275414471149

# Select a column by position, asking for all rows (the ':').
named_ratings.iloc[:, -1]

Discipline
English                   3.162754
Mathematics               3.063322
Biology                   2.710459
Psychology                3.316210
History                   3.053803
                            ...   
Anatomy                   2.863504
Earth Science             3.106727
Linguistics               3.309636
Mechanical Engineering    2.799135
Medicine                  3.109118
Name: Easiness, Length: 75, dtype: float64

In fact, you can’t use Boolean Series for indexing with .iloc, because Pandas won’t let you use a thing with labels for iloc indexing.

# You can't use Boolean Series for iloc indexing.
named_ratings.iloc[is_easy]

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[15], line 2
      1 # You can't use Boolean Series for iloc indexing.
----> 2 named_ratings.iloc[is_easy]

File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/indexing.py:1191, in _LocationIndexer.__getitem__(self, key)
   1189 maybe_callable = com.apply_if_callable(key, self.obj)
   1190 maybe_callable = self._check_deprecated_callable_usage(key, maybe_callable)
-> 1191 return self._getitem_axis(maybe_callable, axis=axis)

File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/indexing.py:1738, in _iLocIndexer._getitem_axis(self, key, axis)
   1735     key = np.asarray(key)
   1737 if com.is_bool_indexer(key):
-> 1738     self._validate_key(key, axis)
   1739     return self._getbool_axis(key, axis=axis)
   1741 # a list of integers

File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/indexing.py:1583, in _iLocIndexer._validate_key(self, key, axis)
   1577         if key.index.inferred_type == "integer":
   1578             raise NotImplementedError(
   1579                 "iLocation based boolean "
   1580                 "indexing on an integer type "
   1581                 "is not available"
   1582             )
-> 1583         raise ValueError(
   1584             "iLocation based boolean indexing cannot use "
   1585             "an indexable as a mask"
   1586         )
   1587     return
   1589 if isinstance(key, slice):

ValueError: iLocation based boolean indexing cannot use an indexable as a mask

If you really want to do Boolean indexing with .iloc, strip the labels (the index) off the thing you are using for indexing:

is_easy_arr = np.array(is_easy)   # Strip off the labels.
named_ratings.iloc[is_easy_arr]

	Number of Professors	Clarity	Helpfulness	Overall Quality	Easiness
Discipline
Music	2455	3.844509	3.787804	3.818114	3.542273
Theater	1078	3.876633	3.821503	3.851837	3.584508
Physical Education	991	4.078698	4.030797	4.057719	3.832250
Speech	351	4.133191	4.101197	4.119345	3.674701
Child Development	171	3.950585	4.008070	3.979766	3.606082
Reading	148	4.159392	4.188919	4.177500	3.882635
Kinesiology	132	3.995000	3.972879	3.988712	3.541439

Trap! Alarm! The row labels can be integers#

Unless you tell it otherwise, Pandas will default to giving rows labels that are integers, rather than names. Therefore your row labels will often be integers. That can be confusing, because the labels can look like positions, when they are not.

# The default row labels are sequential integers.
ratings = pd.read_csv('rate_my_course.csv')
ratings

	Discipline	Number of Professors	Clarity	Helpfulness	Overall Quality	Easiness
0	English	23343	3.756147	3.821866	3.791364	3.162754
1	Mathematics	22394	3.487379	3.641526	3.566867	3.063322
2	Biology	11774	3.608331	3.701530	3.657641	2.710459
3	Psychology	11179	3.909520	3.887536	3.900949	3.316210
4	History	11145	3.788818	3.753642	3.773746	3.053803
...	...	...	...	...	...	...
70	Anatomy	117	3.932991	3.974701	3.954188	2.863504
71	Earth Science	110	3.636182	3.671364	3.655091	3.106727
72	Linguistics	110	3.749000	3.834545	3.798182	3.309636
73	Mechanical Engineering	104	3.441923	3.531154	3.489327	2.799135
74	Medicine	102	3.927255	3.934216	3.929118	3.109118

75 rows × 6 columns

The numbers are labels and not positions, but we can’t see that here, because the labels and the positions are the same. You can show the numbers are labels by selecting rows, or sorting the DataFrame. Let’s sort the DataFrame by ‘Easiness’ to demonstrate.

# Notice that row labels are not sequential any more.
# The labels stick to the row they started with.
ratings_by_easy = ratings.sort_values('Easiness')
ratings_by_easy

	Discipline	Number of Professors	Clarity	Helpfulness	Overall Quality	Easiness
5	Chemistry	7346	3.387174	3.538980	3.465485	2.652054
40	Nursing	641	3.613385	3.608081	3.613573	2.676069
33	Finance	1042	3.346536	3.443301	3.396823	2.693512
2	Biology	11774	3.608331	3.701530	3.657641	2.710459
13	Accounting	3895	3.484249	3.572727	3.531315	2.784706
...	...	...	...	...	...	...
31	Theater	1078	3.876633	3.821503	3.851837	3.584508
61	Child Development	171	3.950585	4.008070	3.979766	3.606082
49	Speech	351	4.133191	4.101197	4.119345	3.674701
36	Physical Education	991	4.078698	4.030797	4.057719	3.832250
64	Reading	148	4.159392	4.188919	4.177500	3.882635

75 rows × 6 columns

Distinguish between label and position using .loc and .iloc:

# The row *labeled* with the number 2.
ratings_by_easy.loc[2]

Discipline               Biology
Number of Professors       11774
Clarity                 3.608331
Helpfulness              3.70153
Overall Quality         3.657641
Easiness                2.710459
Name: 2, dtype: object

# The row at *position* 2 (the third row).
ratings_by_easy.iloc[2]

Discipline               Finance
Number of Professors        1042
Clarity                 3.346536
Helpfulness             3.443301
Overall Quality         3.396823
Easiness                2.693512
Name: 33, dtype: object

Use indirect indexing to select values from Series#

You can also index into Series values. Use indirect indexing to do this.

# Select by label.
easiness.loc['English']

3.16275414471149

# Select using Boolean Series.
easiness.loc[is_easy]

Discipline
Music                 3.542273
Theater               3.584508
Physical Education    3.832250
Speech                3.674701
Child Development     3.606082
Reading               3.882635
Kinesiology           3.541439
Name: Easiness, dtype: float64

# Select by position.
# English is the first value (at position 0).
easiness.iloc[0]

3.16275414471149