Combining boolean arrays

Combining boolean arrays#

Sometimes we want to be able to combine several different criteria to select elements from arrays or tables.

So far we have used boolean Series and arrays to select rows. This works fine when we have some simple criterion, such as whether the value in the column or array is greater than 10.

For example, consider the students ratings dataset dataset. Download the data file via rate_my_course.csv.

import numpy as np
import pandas as pd
pd.set_option('mode.copy_on_write', True)
import matplotlib.pyplot as plt
# Make plots look a little bit more fancy
plt.style.use('fivethirtyeight')
# Read the data file
ratings = pd.read_csv('rate_my_course.csv')
ratings.head()
Discipline Number of Professors Clarity Helpfulness Overall Quality Easiness
0 English 23343 3.756147 3.821866 3.791364 3.162754
1 Mathematics 22394 3.487379 3.641526 3.566867 3.063322
2 Biology 11774 3.608331 3.701530 3.657641 2.710459
3 Psychology 11179 3.909520 3.887536 3.900949 3.316210
4 History 11145 3.788818 3.753642 3.773746 3.053803

We can select the rows from this table where the Easiness rating was above the median, using a boolean series:

easiness = ratings['Easiness']
is_gt_median = easiness > np.median(easiness)
is_gt_median.head()
0    False
1    False
2    False
3     True
4    False
Name: Easiness, dtype: bool
above_median = ratings[is_gt_median]
above_median.head()
Discipline Number of Professors Clarity Helpfulness Overall Quality Easiness
3 Psychology 11179 3.909520 3.887536 3.900949 3.316210
6 Communications 6940 3.867349 3.878602 3.875019 3.379829
11 Sociology 4839 3.740980 3.748169 3.746962 3.395819
14 Languages 3867 3.772780 3.917949 3.846951 3.277406
17 Anthropology 2598 3.693222 3.704761 3.701674 3.248045

What if we wanted to select the rows that were between the 25th and 75th percentile? Here’s how to get the percentile values.

q25 = np.quantile(easiness, 0.25)
q75 = np.quantile(easiness, 0.75)
print(q25, q75)
3.0283298724604153 3.34694063174731

We can do this more neatly with unpacking:

q25, q75 = np.quantile(easiness, [0.25, 0.75])
print(q25, q75)
3.0283298724604153 3.34694063174731

Now we want to select the rows where the Easiness score is between these values. We can do this the long way round, by selecting twice:

# Select values above the 25th percentile.
above_q25 = ratings[easiness > q25]
# There are now fewer Easiness values, so we have to get the values remaining.
q25_easiness = above_q25['Easiness']
# Select values below the 75th percentile.
between_25_75 = above_q25[q25_easiness < q75]
between_25_75.head()
Discipline Number of Professors Clarity Helpfulness Overall Quality Easiness
0 English 23343 3.756147 3.821866 3.791364 3.162754
1 Mathematics 22394 3.487379 3.641526 3.566867 3.063322
3 Psychology 11179 3.909520 3.887536 3.900949 3.316210
4 History 11145 3.788818 3.753642 3.773746 3.053803
7 Business 6120 3.640327 3.680503 3.663332 3.172033

Another, neater way of doing this is to make a single Boolean Series that has True only if the Easiness value is both above the 25th percentile and below the 75th percentile.

This is called a logical and.

To do this we can make a Boolean Series for each of these two criteria:

# True if Easiness is above 25th percentile.
is_gt_q25 = easiness > q25
# Show the first 10 values
is_gt_q25.head(10)
0     True
1     True
2    False
3     True
4     True
5    False
6     True
7     True
8     True
9    False
Name: Easiness, dtype: bool
# True if Easiness is below 75th percentile.
is_lt_q75 = easiness < q75
# Show the first 10 values
is_lt_q75.head(10)
0     True
1     True
2     True
3     True
4     True
5     True
6    False
7     True
8     True
9     True
Name: Easiness, dtype: bool

We can combine these two with Numpy functions. The function we need in this case is np.logical_and.

np.logical_and can work on Pandas Series, or on Numpy arrays. We will use the term sequence for something that can be a Pandas Series or a Numpy array.

np.logical_and combines the two input sequences into a new sequence that only has True in positions where both of the input sequences have a True in the corresponding position:

is_between_25_75 = np.logical_and(is_gt_q25, is_lt_q75)
is_between_25_75.head(10)
0     True
1     True
2    False
3     True
4     True
5    False
6    False
7     True
8     True
9    False
Name: Easiness, dtype: bool

It might be easier to see what is going on if we make some small test arrays:

a = np.array([True, True, False, False])
b = np.array([True, False, True, False])

We can show these conveniently as a DataFrame:

ab = pd.DataFrame()
ab['first input'] = a
ab['second input'] = b
ab
first input second input
0 True True
1 True False
2 False True
3 False False

Before you look, try to work out what you would get from np.logical_and(a, b).

Remember, the rule is, the result will have True where the corresponding element from both a and b are True, and False otherwise.

The result:

np.logical_and(a, b)
array([ True, False, False, False])

Here are the two input columns and the result, displayed as a data frame, to show them nicely:

ab['and result'] = np.logical_and(a, b)
ab
first input second input and result
0 True True True
1 True False False
2 False True False
3 False False False

Check that you agree with Python’s results for combining is_gt_q25 and is_lt_q75 in the same way. Here’s a display showing is_gt_q25, is_lt_q75 and the result of logical_and:

qbools = pd.DataFrame()
qbools['is_gt_q25'] = is_gt_q25
qbools['is_lt_q75'] = is_lt_q75
qbools['and_result'] = np.logical_and(is_gt_q25, is_lt_q75)
qbools.head(10)
is_gt_q25 is_lt_q75 and_result
0 True True True
1 True True True
2 False True False
3 True True True
4 True True True
5 False True False
6 True False False
7 True True True
8 True True True
9 False True False

We can use the combined Boolean series from logical_and to select the rows that we want:

betweeners = ratings[np.logical_and(is_gt_q25, is_lt_q75)]
betweeners.head()
Discipline Number of Professors Clarity Helpfulness Overall Quality Easiness
0 English 23343 3.756147 3.821866 3.791364 3.162754
1 Mathematics 22394 3.487379 3.641526 3.566867 3.063322
3 Psychology 11179 3.909520 3.887536 3.900949 3.316210
4 History 11145 3.788818 3.753642 3.773746 3.053803
7 Business 6120 3.640327 3.680503 3.663332 3.172033

Notice that we only have rows where there is a corresponding True value in the result of the logical_and, and therefore, that we only have rows that are above the 25th percentile, and below the 75th percentile.

You may not be surprised to know there is an equivalent function to logical_and called logical_or. Like logical_and this returns a Boolean sequence of the same length as the input sequences. There is a True in the output sequence where one or both of the input sequences have True in the corresponding positions.

a
array([ True,  True, False, False])
b
array([ True, False,  True, False])
np.logical_or(a, b)
array([ True,  True,  True, False])
ab['or result'] = np.logical_or(a, b)
ab
first input second input and result or result
0 True True True True
1 True False False True
2 False True False True
3 False False False False

We can use this function to find all the rows that have Easiness ratings above the 75th percentile or below the 25th percentile:

easy_or_hard = ratings[np.logical_or(easiness < q25, easiness > q75)]
easy_or_hard.head()
Discipline Number of Professors Clarity Helpfulness Overall Quality Easiness
2 Biology 11774 3.608331 3.701530 3.657641 2.710459
5 Chemistry 7346 3.387174 3.538980 3.465485 2.652054
6 Communications 6940 3.867349 3.878602 3.875019 3.379829
9 Economics 5540 3.382735 3.483617 3.435038 2.910078
11 Sociology 4839 3.740980 3.748169 3.746962 3.395819