Booleans and labels

Booleans and labels#

We have already used Boolean series to index data frames and other series.

This page gives a little more details about how that works, in order to explain some subtleties in results.

We return to the World Bank statistics on development and gender in gender_stats.csv.

Download that file to the same directory as this notebook, if you are running on your own computer.

Here we pull out the rows for the countries with the highest Gross Domestic Product. This may be familiar from the data frame intro and the Pandas indexing page.

import numpy as np
import pandas as pd
# Safe setting for Pandas.  Needs Pandas version >= 1.5.
pd.set_option('mode.copy_on_write', True)

# Load the data file as a data frame.
gender_data = pd.read_csv('gender_stats.csv')

# Sort by GDP.
df_by_gdp = gender_data.sort_values('gdp_us_billion', ascending=False)
# Take the top 5 rows.
richest_5 = df_by_gdp.head(5)
richest_5

	country_name	country_code	fert_rate	gdp_us_billion	health_exp_per_cap	health_exp_pub	prim_ed_girls	mat_mort_ratio	population
202	United States	USA	1.860875	17369.124600	9060.068657	8.121961	48.758830	14.00	318.558175
35	China	CHN	1.558750	10182.790479	657.748859	3.015530	46.297964	28.75	1364.446000
97	Japan	JPN	1.430000	5106.024760	3687.126279	8.496074	48.744199	5.75	127.297102
49	Germany	DEU	1.450000	3601.226158	4909.659884	8.542615	48.568695	6.25	81.281645
67	United Kingdom	GBR	1.842500	2768.864417	3357.983675	7.720655	48.791809	9.25	64.641557

Consider the index (row labels) of this 5-row data frame:

richest_5.index

Index([202, 35, 97, 49, 67], dtype='int64')

Now let us say we want to select some of these rows with Boolean indexing. Here is a Boolean series with True for rows where health_exp_per_cap is greater than 3500 dollars, False otherwise.

# Create a Boolean series with True for big spender rows, False otherwise.
is_big_spender = richest_5['health_exp_per_cap'] > 3500
is_big_spender

   True
   False
    True
    True
   False
Name: health_exp_per_cap, dtype: bool

Notice that this Series, like all Series, has an index, which is the same as the index from the Series we used to make it (richest_5['health_exp_per_cap']), and therefore, the same as the index for the data frame (richest_5).

is_big_spender.index

Index([202, 35, 97, 49, 67], dtype='int64')

Let’s say we are interested in the country_name for countries spending more than 3500 dollars per capita on health care. We can do this with, for example, loc:

richest_5.loc[is_big_spender, 'country_name']

  United States
           Japan
         Germany
Name: country_name, dtype: object

loc gives us the rows for which corresponding elements in is_big_spender are True.

But what does corresponding mean? Does it mean corresponding in terms of position? Or in terms of index (row / element labels)?

Here’s the Boolean series again:

is_big_spender

   True
   False
    True
    True
   False
Name: health_exp_per_cap, dtype: bool

If corresponding means corresponding in terms of position then we would take the first row of richest_5 (because the first element in is_big_spender is True) but not the second row (because the second element in is_big_spender is False) and so on.

If corresponding means corresponding in terms of labels then we would take the row from richest_5 with label 202 (because the element labeled 202 in is_big_spender is True), but we do not take the row labeled 35 (because the element labeled 35 in is_big_spender is False), and so on.

We cannot distinguish between these alternatives at the moment, because the order of the labels is the same in the data frame richest_5 and the series is_big_spender, so positions and labels will give the same answer.

We can distinguish when the order of the labels is not the same in the Boolean series and the data frame. For example, we can sort the series, like this:

sorted_is_big_spender = is_big_spender.sort_values()
sorted_is_big_spender

   False
   False
   True
    True
    True
Name: health_exp_per_cap, dtype: bool

Now the order of the row labels is different in the series (above) and the data frame (below):

richest_5.index

Index([202, 35, 97, 49, 67], dtype='int64')

If we index with this sorted series, and indexing uses the position of the True and False values, we would expect to see the last three rows, because the values in the last three positions are True (and the others are False).

If indexing uses labels, then we expect to get the same answer we got before, because the relationship between the True / False values and labels hasn’t changed - the elements labeled 202, 97 and 49 have True values, so we expect to see those rows in the result. So (drum roll):

richest_5.loc[sorted_is_big_spender, 'country_name']

  United States
           Japan
         Germany
Name: country_name, dtype: object

The result is the same as for the not-sorted series - and we have found that loc indexing uses the labels rather than the positions when indexing with a Boolean series.

On reflection, maybe that is not surprising - after all, we already know that loc indexing does indexing by label.

We usually will not have to worry about the difference between labels and positions in Boolean Series, because, usually, the row labels and positions are the same in the thing we index (data frame, or Series) and the Boolean series we use for indexing. This was the case at the top of this page, and in all previous times you have seen Boolean indexing on data frames or series.

What about iloc?#

iloc does indexing by position - so what approach does iloc take to the labels on a Boolean series?

We guess that iloc does not use the labels on a Boolean series, but what does it do? Does it throw away the labels and just look at the positions of the True and False values? Or something else?

# iloc indexing with a Boolean series.
# The 'country_name' column is the first column,
# so it is at offset 0 in the columns.
richest_5.iloc[sorted_is_big_spender, 0]

---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
Cell In[11], line 4
      1 # iloc indexing with a Boolean series.
      2 # The 'country_name' column is the first column,
      3 # so it is at offset 0 in the columns.
----> 4 richest_5.iloc[sorted_is_big_spender, 0]

File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/indexing.py:1184, in _LocationIndexer.__getitem__(self, key)
   1182     if self._is_scalar_access(key):
   1183         return self.obj._get_value(*key, takeable=self._takeable)
-> 1184     return self._getitem_tuple(key)
   1185 else:
   1186     # we by definition only have the 0th axis
   1187     axis = self.axis or 0

File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/indexing.py:1690, in _iLocIndexer._getitem_tuple(self, tup)
   1689 def _getitem_tuple(self, tup: tuple):
-> 1690     tup = self._validate_tuple_indexer(tup)
   1691     with suppress(IndexingError):
   1692         return self._getitem_lowerdim(tup)

File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/indexing.py:966, in _LocationIndexer._validate_tuple_indexer(self, key)
    964 for i, k in enumerate(key):
    965     try:
--> 966         self._validate_key(k, i)
    967     except ValueError as err:
    968         raise ValueError(
    969             "Location based indexing can only have "
    970             f"[{self._valid_types}] types"
    971         ) from err

File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/indexing.py:1578, in _iLocIndexer._validate_key(self, key, axis)
   1576 if hasattr(key, "index") and isinstance(key.index, Index):
   1577     if key.index.inferred_type == "integer":
-> 1578         raise NotImplementedError(
   1579             "iLocation based boolean "
   1580             "indexing on an integer type "
   1581             "is not available"
   1582         )
   1583     raise ValueError(
   1584         "iLocation based boolean indexing cannot use "
   1585         "an indexable as a mask"
   1586     )
   1587 return

NotImplementedError: iLocation based boolean indexing on an integer type is not available

It does something else - it gives an error, telling us that it will not use the index. This is so whatever the index order - even if the index and the positions give the same answer:

# iloc with the original Boolean series, where positions
# and labels match.
richest_5.iloc[is_big_spender, 0]

---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
Cell In[12], line 3
      1 # iloc with the original Boolean series, where positions
      2 # and labels match.
----> 3 richest_5.iloc[is_big_spender, 0]

File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/indexing.py:1184, in _LocationIndexer.__getitem__(self, key)
   1182     if self._is_scalar_access(key):
   1183         return self.obj._get_value(*key, takeable=self._takeable)
-> 1184     return self._getitem_tuple(key)
   1185 else:
   1186     # we by definition only have the 0th axis
   1187     axis = self.axis or 0

File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/indexing.py:1690, in _iLocIndexer._getitem_tuple(self, tup)
   1689 def _getitem_tuple(self, tup: tuple):
-> 1690     tup = self._validate_tuple_indexer(tup)
   1691     with suppress(IndexingError):
   1692         return self._getitem_lowerdim(tup)

File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/indexing.py:966, in _LocationIndexer._validate_tuple_indexer(self, key)
    964 for i, k in enumerate(key):
    965     try:
--> 966         self._validate_key(k, i)
    967     except ValueError as err:
    968         raise ValueError(
    969             "Location based indexing can only have "
    970             f"[{self._valid_types}] types"
    971         ) from err

File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/indexing.py:1578, in _iLocIndexer._validate_key(self, key, axis)
   1576 if hasattr(key, "index") and isinstance(key.index, Index):
   1577     if key.index.inferred_type == "integer":
-> 1578         raise NotImplementedError(
   1579             "iLocation based boolean "
   1580             "indexing on an integer type "
   1581             "is not available"
   1582         )
   1583     raise ValueError(
   1584         "iLocation based boolean indexing cannot use "
   1585         "an indexable as a mask"
   1586     )
   1587 return

NotImplementedError: iLocation based boolean indexing on an integer type is not available

If you want to use Boolean indexing with iloc, you can use a Boolean sequence without row / element labels, such as a Numpy array. We can achieve that by converting the Boolean Series to a Numpy array, and therefore, throwing away the index (element labels):

sorted_is_big_spender_arr = np.array(sorted_is_big_spender)
sorted_is_big_spender_arr

array([False, False,  True,  True,  True])

This Boolean sequence will work with iloc, because it has no index (element labels) to trip it up:

# You can use a Boolean array to index rows, with "iloc".
richest_5.iloc[sorted_is_big_spender_arr, 0]

           Japan
         Germany
  United Kingdom
Name: country_name, dtype: object

Notice this does gives us the rows where the matching position has True.

Booleans and labels

Contents

Booleans and labels#

What about iloc?#