Combining boolean arrays#
Sometimes we want to be able to combine several different criteria to select elements from arrays or tables.
So far we have used boolean Series and arrays to select rows. This works fine when we have some simple criterion, such as whether the value in the column or array is greater than 10.
For example, consider the students ratings
dataset dataset. Download the data file via
rate_my_course.csv
.
import numpy as np
import pandas as pd
pd.set_option('mode.copy_on_write', True)
import matplotlib.pyplot as plt
# Make plots look a little bit more fancy
plt.style.use('fivethirtyeight')
# Read the data file
ratings = pd.read_csv('rate_my_course.csv')
ratings.head()
Discipline | Number of Professors | Clarity | Helpfulness | Overall Quality | Easiness | |
---|---|---|---|---|---|---|
0 | English | 23343 | 3.756147 | 3.821866 | 3.791364 | 3.162754 |
1 | Mathematics | 22394 | 3.487379 | 3.641526 | 3.566867 | 3.063322 |
2 | Biology | 11774 | 3.608331 | 3.701530 | 3.657641 | 2.710459 |
3 | Psychology | 11179 | 3.909520 | 3.887536 | 3.900949 | 3.316210 |
4 | History | 11145 | 3.788818 | 3.753642 | 3.773746 | 3.053803 |
We can select the rows from this table where the Easiness rating was above the median, using a boolean series:
easiness = ratings['Easiness']
is_gt_median = easiness > np.median(easiness)
is_gt_median.head()
0 False
1 False
2 False
3 True
4 False
Name: Easiness, dtype: bool
above_median = ratings[is_gt_median]
above_median.head()
Discipline | Number of Professors | Clarity | Helpfulness | Overall Quality | Easiness | |
---|---|---|---|---|---|---|
3 | Psychology | 11179 | 3.909520 | 3.887536 | 3.900949 | 3.316210 |
6 | Communications | 6940 | 3.867349 | 3.878602 | 3.875019 | 3.379829 |
11 | Sociology | 4839 | 3.740980 | 3.748169 | 3.746962 | 3.395819 |
14 | Languages | 3867 | 3.772780 | 3.917949 | 3.846951 | 3.277406 |
17 | Anthropology | 2598 | 3.693222 | 3.704761 | 3.701674 | 3.248045 |
What if we wanted to select the rows that were between the 25th and 75th percentile? Here’s how to get the percentile values.
q25 = np.quantile(easiness, 0.25)
q75 = np.quantile(easiness, 0.75)
print(q25, q75)
3.0283298724604153 3.34694063174731
We can do this more neatly with unpacking:
q25, q75 = np.quantile(easiness, [0.25, 0.75])
print(q25, q75)
3.0283298724604153 3.34694063174731
Now we want to select the rows where the Easiness score is between these values. We can do this the long way round, by selecting twice:
# Select values above the 25th percentile.
above_q25 = ratings[easiness > q25]
# There are now fewer Easiness values, so we have to get the values remaining.
q25_easiness = above_q25['Easiness']
# Select values below the 75th percentile.
between_25_75 = above_q25[q25_easiness < q75]
between_25_75.head()
Discipline | Number of Professors | Clarity | Helpfulness | Overall Quality | Easiness | |
---|---|---|---|---|---|---|
0 | English | 23343 | 3.756147 | 3.821866 | 3.791364 | 3.162754 |
1 | Mathematics | 22394 | 3.487379 | 3.641526 | 3.566867 | 3.063322 |
3 | Psychology | 11179 | 3.909520 | 3.887536 | 3.900949 | 3.316210 |
4 | History | 11145 | 3.788818 | 3.753642 | 3.773746 | 3.053803 |
7 | Business | 6120 | 3.640327 | 3.680503 | 3.663332 | 3.172033 |
Another, neater way of doing this is to make a single Boolean Series that has True only if the Easiness value is both above the 25th percentile and below the 75th percentile.
This is called a logical and.
To do this we can make a Boolean Series for each of these two criteria:
# True if Easiness is above 25th percentile.
is_gt_q25 = easiness > q25
# Show the first 10 values
is_gt_q25.head(10)
0 True
1 True
2 False
3 True
4 True
5 False
6 True
7 True
8 True
9 False
Name: Easiness, dtype: bool
# True if Easiness is below 75th percentile.
is_lt_q75 = easiness < q75
# Show the first 10 values
is_lt_q75.head(10)
0 True
1 True
2 True
3 True
4 True
5 True
6 False
7 True
8 True
9 True
Name: Easiness, dtype: bool
We can combine these two with Numpy functions. The function we need in this
case is np.logical_and
.
np.logical_and
can work on Pandas Series, or on Numpy arrays. We will use
the term sequence for something that can be a Pandas Series or a Numpy
array.
np.logical_and
combines the two input sequences into a new sequence that
only has True in positions where both of the input sequences have a True in
the corresponding position:
is_between_25_75 = np.logical_and(is_gt_q25, is_lt_q75)
is_between_25_75.head(10)
0 True
1 True
2 False
3 True
4 True
5 False
6 False
7 True
8 True
9 False
Name: Easiness, dtype: bool
It might be easier to see what is going on if we make some small test arrays:
a = np.array([True, True, False, False])
b = np.array([True, False, True, False])
We can show these conveniently as a DataFrame:
ab = pd.DataFrame()
ab['first input'] = a
ab['second input'] = b
ab
first input | second input | |
---|---|---|
0 | True | True |
1 | True | False |
2 | False | True |
3 | False | False |
Before you look, try to work out what you would get from np.logical_and(a, b)
.
Remember, the rule is, the result will have True where the corresponding
element from both a
and b
are True, and False otherwise.
The result:
np.logical_and(a, b)
array([ True, False, False, False])
Here are the two input columns and the result, displayed as a data frame, to show them nicely:
ab['and result'] = np.logical_and(a, b)
ab
first input | second input | and result | |
---|---|---|---|
0 | True | True | True |
1 | True | False | False |
2 | False | True | False |
3 | False | False | False |
Check that you agree with Python’s results for combining is_gt_q25
and
is_lt_q75
in the same way. Here’s a display showing is_gt_q25
,
is_lt_q75
and the result of logical_and
:
qbools = pd.DataFrame()
qbools['is_gt_q25'] = is_gt_q25
qbools['is_lt_q75'] = is_lt_q75
qbools['and_result'] = np.logical_and(is_gt_q25, is_lt_q75)
qbools.head(10)
is_gt_q25 | is_lt_q75 | and_result | |
---|---|---|---|
0 | True | True | True |
1 | True | True | True |
2 | False | True | False |
3 | True | True | True |
4 | True | True | True |
5 | False | True | False |
6 | True | False | False |
7 | True | True | True |
8 | True | True | True |
9 | False | True | False |
We can use the combined Boolean series from logical_and
to select the rows
that we want:
betweeners = ratings[np.logical_and(is_gt_q25, is_lt_q75)]
betweeners.head()
Discipline | Number of Professors | Clarity | Helpfulness | Overall Quality | Easiness | |
---|---|---|---|---|---|---|
0 | English | 23343 | 3.756147 | 3.821866 | 3.791364 | 3.162754 |
1 | Mathematics | 22394 | 3.487379 | 3.641526 | 3.566867 | 3.063322 |
3 | Psychology | 11179 | 3.909520 | 3.887536 | 3.900949 | 3.316210 |
4 | History | 11145 | 3.788818 | 3.753642 | 3.773746 | 3.053803 |
7 | Business | 6120 | 3.640327 | 3.680503 | 3.663332 | 3.172033 |
Notice that we only have rows where there is a corresponding True value in the
result of the logical_and
, and therefore, that we only have rows that are
above the 25th percentile, and below the 75th percentile.
You may not be surprised to know there is an equivalent function to
logical_and
called logical_or
. Like logical_and
this returns a Boolean
sequence of the same length as the input sequences. There is a True in the
output sequence where one or both of the input sequences have True in the
corresponding positions.
a
array([ True, True, False, False])
b
array([ True, False, True, False])
np.logical_or(a, b)
array([ True, True, True, False])
ab['or result'] = np.logical_or(a, b)
ab
first input | second input | and result | or result | |
---|---|---|---|---|
0 | True | True | True | True |
1 | True | False | False | True |
2 | False | True | False | True |
3 | False | False | False | False |
We can use this function to find all the rows that have Easiness ratings above the 75th percentile or below the 25th percentile:
easy_or_hard = ratings[np.logical_or(easiness < q25, easiness > q75)]
easy_or_hard.head()
Discipline | Number of Professors | Clarity | Helpfulness | Overall Quality | Easiness | |
---|---|---|---|---|---|---|
2 | Biology | 11774 | 3.608331 | 3.701530 | 3.657641 | 2.710459 |
5 | Chemistry | 7346 | 3.387174 | 3.538980 | 3.465485 | 2.652054 |
6 | Communications | 6940 | 3.867349 | 3.878602 | 3.875019 | 3.379829 |
9 | Economics | 5540 | 3.382735 | 3.483617 | 3.435038 | 2.910078 |
11 | Sociology | 4839 | 3.740980 | 3.748169 | 3.746962 | 3.395819 |