Handling Pandas safely#
A lot of Pandas’ design is for speed and efficiency.
Unfortunately, this sometimes means that is it easy to use Pandas incorrectly, and so get results that you do not expect.
If you have Pandas version 1.5 or later, you can skip this page#
This page discusses the problems that can come up when Pandas keeps links between different DataFrames and Series. As you will see below, this is the issue of Pandas copies and views.
Luckily, as of Pandas version 1.5, there is an option you can enable that will allow you to avoid this rather complicated distinction, and, if you have a Pandas version of 1.5 or greater, we strongly suggest you enable that option, like this:
import pandas as pd
pd.set_option('mode.copy_on_write', True)
You will see that option in all the notebooks from this course, and, if you can, we suggest you set that option whenever you import and use Pandas.
You will see more details about what the option means further down this page, so read on if you are interested.
Avoiding trouble#
The rest of this page has some background on the issue of Pandas copies and
views, and an explanation of the problems that can come up for older Pandas, or
when you do not enable the mode.copy_on_write
option. We explain the
mode.copy_on_write
option, and give some rules to help you stay out of
trouble, if you cannot use mode.copy_on_write
.
Background: copies and views#
Consider this DataFrame, which should be familiar. It is a table where the rows are course subjects and the columns include average ratings for all University professors / lecturers teaching that subject. See the dataset page for more detail.
import pandas as pd
Notice that we have not yet enabled the mode.copy_on_write
option.
We get the ratings:
all_ratings = pd.read_csv('rate_my_course.csv')
To ease some later exposition, we select the first 10 rows, and set the row labels (index) to be letters rather than numbers:
ratings = all_ratings.iloc[:10]
ratings.index = list('ABCDEFGHIJ')
ratings
Discipline | Number of Professors | Clarity | Helpfulness | Overall Quality | Easiness | |
---|---|---|---|---|---|---|
A | English | 23343 | 3.756147 | 3.821866 | 3.791364 | 3.162754 |
B | Mathematics | 22394 | 3.487379 | 3.641526 | 3.566867 | 3.063322 |
C | Biology | 11774 | 3.608331 | 3.701530 | 3.657641 | 2.710459 |
D | Psychology | 11179 | 3.909520 | 3.887536 | 3.900949 | 3.316210 |
E | History | 11145 | 3.788818 | 3.753642 | 3.773746 | 3.053803 |
F | Chemistry | 7346 | 3.387174 | 3.538980 | 3.465485 | 2.652054 |
G | Communications | 6940 | 3.867349 | 3.878602 | 3.875019 | 3.379829 |
H | Business | 6120 | 3.640327 | 3.680503 | 3.663332 | 3.172033 |
I | Political Science | 5824 | 3.759018 | 3.748676 | 3.756197 | 3.057758 |
J | Economics | 5540 | 3.382735 | 3.483617 | 3.435038 | 2.910078 |
Now imagine that we have discovered that the rating for ‘Clarity’ in the first row is incorrect; it should be 4.0.
We get ready to make a new, fixed copy of the DataFrame, to store the modified values. We put the ‘Disciplines’ column into the DataFrame to start with.
fixed_ratings = pd.DataFrame()
fixed_ratings['Discipline'] = ratings['Discipline']
Our next obvious step is to get the ‘Clarity’ column as a Pandas Series, for us to work on.
clarity = ratings['Clarity']
clarity.head()
A 3.756147
B 3.487379
C 3.608331
D 3.909520
E 3.788818
Name: Clarity, dtype: float64
We set the corrected first value:
clarity.loc['A'] = 4
clarity.head()
/tmp/ipykernel_6218/111022653.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
clarity.loc['A'] = 4
A 4.000000
B 3.487379
C 3.608331
D 3.909520
E 3.788818
Name: Clarity, dtype: float64
Notice the warning. We will come back to that soon.
Notice too that we have changed the value in the clarity
Series.
Consider — what happens to the matching value in the original DataFrame?
To answer that question, we need to know what kind of thing our clarity
Series was. If you have not enabled mode.copy_on_write
, the clarity
could be a copy or a view.
If the clarity
Series is a view, then it still refers directly to the
‘Clarity’ column in the original data frame ratings
. A view is something
that points to the same memory. When we have a view, the view is another way
of looking at the same data. If we modify the data in the view, that means
we also modify the original DataFrame, because the data is the same.
clarity
could also be copy of the ‘Clarity’ column. A copy duplicates the
values from the original data. Therefore a copy has its own values, and its
own memory. Changing the data in the copy will have no effect on the original
DataFrame, because the data is different.
Note: if you have enabled mode.copy_on_view
, clarity
will always
(effectively) be a copy, and you will not see the behavior below.
ratings.head()
Discipline | Number of Professors | Clarity | Helpfulness | Overall Quality | Easiness | |
---|---|---|---|---|---|---|
A | English | 23343 | 4.000000 | 3.821866 | 3.791364 | 3.162754 |
B | Mathematics | 22394 | 3.487379 | 3.641526 | 3.566867 | 3.063322 |
C | Biology | 11774 | 3.608331 | 3.701530 | 3.657641 | 2.710459 |
D | Psychology | 11179 | 3.909520 | 3.887536 | 3.900949 | 3.316210 |
E | History | 11145 | 3.788818 | 3.753642 | 3.773746 | 3.053803 |
We have found that the clarity
Series was a view, because the change we
made to clarity
also changed the value in the original DataFrame.
This may not be what you expected, so you probably did not mean to change the original data.
There are two basic strategies for dealing with this problem.
New Strategy (Pandas >= 1.5): automatic copies when needed#
This strategy uses a feature that is new in Pandas version 1.5.
The summary is — always put the following line after you import Pandas, and before you execute any code using Pandas:
# Ask Pandas to make a copy under the hood, when needed.
pd.set_option('mode.copy_on_write', True)
After you apply this option, Pandas uses an algorithm to work out when to make a copy. You can think if the option as being equivalent to making everything a copy. For example, consider the problem we had above.
# The current values of the `ratings` DataFrame.
ratings.head()
Discipline | Number of Professors | Clarity | Helpfulness | Overall Quality | Easiness | |
---|---|---|---|---|---|---|
A | English | 23343 | 4.000000 | 3.821866 | 3.791364 | 3.162754 |
B | Mathematics | 22394 | 3.487379 | 3.641526 | 3.566867 | 3.063322 |
C | Biology | 11774 | 3.608331 | 3.701530 | 3.657641 | 2.710459 |
D | Psychology | 11179 | 3.909520 | 3.887536 | 3.900949 | 3.316210 |
E | History | 11145 | 3.788818 | 3.753642 | 3.773746 | 3.053803 |
# A column from the DataFrame.
clarity = ratings['Clarity']
clarity.head()
A 4.000000
B 3.487379
C 3.608331
D 3.909520
E 3.788818
Name: Clarity, dtype: float64
As before, we set another corrected first value:
clarity.loc['A'] = 99
clarity.head()
A 99.000000
B 3.487379
C 3.608331
D 3.909520
E 3.788818
Name: Clarity, dtype: float64
We set clarity
as we expected. But this time, with the mode.copy_on_write
option, we did not change the ratings
DataFrame from which we selected the
clarity
values.
ratings.head()
Discipline | Number of Professors | Clarity | Helpfulness | Overall Quality | Easiness | |
---|---|---|---|---|---|---|
A | English | 23343 | 4.000000 | 3.821866 | 3.791364 | 3.162754 |
B | Mathematics | 22394 | 3.487379 | 3.641526 | 3.566867 | 3.063322 |
C | Biology | 11774 | 3.608331 | 3.701530 | 3.657641 | 2.710459 |
D | Psychology | 11179 | 3.909520 | 3.887536 | 3.900949 | 3.316210 |
E | History | 11145 | 3.788818 | 3.753642 | 3.773746 | 3.053803 |
Notice that the first Clarity
value in ratings
did not change — it is still
4 and not 99.
The value in ratings
did not change because you can think of the
ratings['Clarity']
expression as always taking a copy not a view
[1].
If you have Pandas >= 1.5, we strongly suggest you apply this strategy. And in
fact, you will see that all the notebooks in this course that import pandas
also have the magic line:
pd.set_option('mode.copy_on_write', True)
“Chained assignment” and copy-on-write#
Remember, we have mode.copy_on_write
enabled here.
Consider the following code.
row_A = ratings.loc['A'] # Effectively, a *copy* of row labeled A.
row_A.loc['Clarity'] = 199
Sure enough, you have set the row_A
Clarity
value:
row_A
Discipline English
Number of Professors 23343
Clarity 199
Helpfulness 3.821866
Overall Quality 3.791364
Easiness 3.162754
Name: A, dtype: object
At this stage, with mode.copy_on_write
enabled, you would expect the first
row of ratings
to stay the same, because Pandas effectively copies the first
row, before doing the assignment into the copy. And you’d be right to expect
that.
ratings.loc['A']
Discipline English
Number of Professors 23343
Clarity 4.0
Helpfulness 3.821866
Overall Quality 3.791364
Easiness 3.162754
Name: A, dtype: object
But — you may sometimes fail to think of this copy, and be surprised at the result. For example, consider the following code:
# "Chained assignment".
# Assigning a value to a chain of fetched values.
ratings.loc['A'].loc['Clarity'] = 199
ratings.loc['A']
/tmp/ipykernel_6218/4222313501.py:3: ChainedAssignmentError: A value is trying to be set on a copy of a DataFrame or Series through chained assignment.
When using the Copy-on-Write mode, such chained assignment never works to update the original DataFrame or Series, because the intermediate object on which we are setting values always behaves as a copy.
Try using '.loc[row_indexer, col_indexer] = value' instead, to perform the assignment in a single step.
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
ratings.loc['A'].loc['Clarity'] = 199
Discipline English
Number of Professors 23343
Clarity 4.0
Helpfulness 3.821866
Overall Quality 3.791364
Easiness 3.162754
Name: A, dtype: object
Notice that here, ratings.loc['A']
does not change.
This kind of code is sometimes called chained assignment because you are
chaining the fetch of the values on the left hand side. First you are
fetching ratings.loc['A']
and then, from the result, you fetching the Clarity
value. Then you are assigning to this chain of fetched values.
Chained assignment can be confusing, because the first line in the cell above
looks as if it is setting the Clarity
value for the row labeled ‘A’. But
in fact, the code is exactly equivalent to the code cells just above that, and
has the same effect. That is ratings.loc['A']
effectively results in a copy,
so ratings.loc['A'].loc['Clarity'] = 99
is setting the Clarity
value to 99
in the copy, which Python will then immediately discard, because you are not
storing the copy anywhere. So, if you are not careful, you may think you are
modifying the underlying ratings
DataFrame, but you are not, because of the
internal copying implied by mode.copy_on_write
.
If you do want to set the Clarity
value of the row labeled ‘A’, you need to
have a left hand side that does not use the chaining that you see above. To do
this, specify the row and column in a single left-hand-side expression, like
this:
# "Unchained assignment".
# Assigning a value directly to a fetched value, no chain.
ratings.loc['A', 'Clarity'] = 199 # No chain on the left-hand side.
ratings.loc['A']
Discipline English
Number of Professors 23343
Clarity 199.0
Helpfulness 3.821866
Overall Quality 3.791364
Easiness 3.162754
Name: A, dtype: object
Old Strategy (for Pandas < 1.5): three simple rules#
But now we return to the older, darker world of Pandas < 1.5, where you cannot
enable mode.copy_on_write
. What should you do then? In the rest of the
page, we suggest and explain three simple rules to stay out of trouble.
As your understanding increases, you may find that you can relax some of these rules, but the problems in this page can trip up experts, so please, be very careful, and only relax these rules when you are very confident you understand the underlying problems. See Gory Pandas for a short walk through some of the complexities.
To make the rest of the notebook be more like older Pandas, we turn off the
mode.copy_on_write
feature:
# To make Pandas in the rest of this notebook look more like Pandas < 1.5.
pd.set_option('mode.copy_on_write', False)
Old strategy rule 1: copy right.#
We strongly suggest that when you get stuff out of a Pandas DataFrame or Series by indexing, to use as a right-hand-side value, you always force Pandas to take a copy.
We call this rule copy right.
As a reminder indexing is where we fetch data from something using square
brackets. Indexing can be: direct, with the square brackets directly
following the DataFrame or Series; or indirect, where the square brackets
follow the .loc
or .iloc
attributes of the DataFrame or Series.
For example, we have just used direct indexing (square brackets) to fetch the
‘Clarity’ data out of the ratings
DataFrame.
# Indexing to fetch a Series from a DataFrame.
clarity = ratings['Clarity']
We earlier found that, without mode.copy_on_write
, clarity
is a view
onto the ‘Clarity’ data in ratings
. This is rarely what we want.
Here we apply the copy right rule:
# Applying the "copy right" rule.
clearer_clarity = ratings['Clarity'].copy()
Notice we apply the .copy()
method to the ‘Clarity’ Series, so forcing Pandas
to make and return a copy of the data.
Now we have done that, we can modify the result without affecting the original DataFrame, because we are changing the copy, not the original.
# Modify the copy with some crazy value.
clearer_clarity.loc['A'] = 99
clearer_clarity.head()
A 99.000000
B 3.487379
C 3.608331
D 3.909520
E 3.788818
Name: Clarity, dtype: float64
This does not affect the original DataFrame:
ratings.head()
Discipline | Number of Professors | Clarity | Helpfulness | Overall Quality | Easiness | |
---|---|---|---|---|---|---|
A | English | 23343 | 199.000000 | 3.821866 | 3.791364 | 3.162754 |
B | Mathematics | 22394 | 3.487379 | 3.641526 | 3.566867 | 3.063322 |
C | Biology | 11774 | 3.608331 | 3.701530 | 3.657641 | 2.710459 |
D | Psychology | 11179 | 3.909520 | 3.887536 | 3.900949 | 3.316210 |
E | History | 11145 | 3.788818 | 3.753642 | 3.773746 | 3.053803 |
A digression: copies, views, confusing, warnings#
It can be very difficult to predict when Pandas indexing will give a copy or a view.
For example, here we use indirect indexing (square brackets following .loc
)
to select the row of ratings
with index label ‘A’. Remember .loc
indexing
uses the index labels.
row_A = ratings.loc['A']
row_A
Discipline English
Number of Professors 23343
Clarity 199.0
Helpfulness 3.821866
Overall Quality 3.791364
Easiness 3.162754
Name: A, dtype: object
We saw earlier that direct indexing to select a column ‘Clarity’ gave us a
view, meaning that we could change the values in the DataFrame by changing the
Series clarity
we got from indexing. In fact this is also true if we use
indirect indexing with .loc
or .iloc
. Check this by trying clarity = ratings.loc[:, 'Clarity']
in the code above.
We have just fetched the row labeled ‘A’ using .loc
. Given what we know
about fetching a column, it would be reasonable to predict this would give us a
view.
Does it give a view? Or a copy?
# Changing the 'Clarity' value of the first row.
row_A.loc['Clarity'] = 5
row_A
/tmp/ipykernel_6218/3087147381.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
row_A.loc['Clarity'] = 5
Discipline English
Number of Professors 23343
Clarity 5
Helpfulness 3.821866
Overall Quality 3.791364
Easiness 3.162754
Name: A, dtype: object
Notice the warning, again.
But - this time - did we change the original DataFrame?
ratings.head()
Discipline | Number of Professors | Clarity | Helpfulness | Overall Quality | Easiness | |
---|---|---|---|---|---|---|
A | English | 23343 | 199.000000 | 3.821866 | 3.791364 | 3.162754 |
B | Mathematics | 22394 | 3.487379 | 3.641526 | 3.566867 | 3.063322 |
C | Biology | 11774 | 3.608331 | 3.701530 | 3.657641 | 2.710459 |
D | Psychology | 11179 | 3.909520 | 3.887536 | 3.900949 | 3.316210 |
E | History | 11145 | 3.788818 | 3.753642 | 3.773746 | 3.053803 |
No, we didn’t change the original DataFrame — and we conclude that row_A
is a
copy.
Our first, correct, response is to follow the copy right rule, and make this copy explicit, so we know exactly what we have:
# The "copy right" rule again.
copied_row_A = ratings.loc['A'].copy()
We no longer have a nasty warning when we modify copied_row_A
, because Pandas
knows we made a copy, so it does not need to warn us that we may be making a
mistake:
# We don't get a warning when we change the copied result.
copied_row_A.loc['Clarity'] = 5
copied_row_A
Discipline English
Number of Professors 23343
Clarity 5
Helpfulness 3.821866
Overall Quality 3.791364
Easiness 3.162754
Name: A, dtype: object
Please do worry about these warnings. In fact, in the interests of safety, we come to old strategy rule 2.
Old strategy rule 2: make errors for copy/view warnings#
Pandas has a setting that allows you to change the nasty warning about setting with copies into an error.
If you can’t enable mode.copy_on_write
as above, we strongly suggest that you
do enable these errors, for all your notebooks, like this:
pd.set_option('mode.chained_assignment', 'raise')
After you have set this option, Pandas will stop if you try to do something like the following:
row_A = ratings.loc['A'] # Copy? Or view? Difficult to guess.
# Now this generates an error.
row_A.loc['Clarity'] = 299
---------------------------------------------------------------------------
SettingWithCopyError Traceback (most recent call last)
/tmp/ipykernel_6218/1769055592.py in ?()
1 row_A = ratings.loc['A'] # Copy? Or view? Difficult to guess.
2 # Now this generates an error.
----> 3 row_A.loc['Clarity'] = 299
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/indexing.py in ?(self, key, value)
907 indexer = self._get_setitem_indexer(key)
908 self._has_valid_setitem_indexer(key)
909
910 iloc = self if self.name == "iloc" else self.obj.iloc
--> 911 iloc._setitem_with_indexer(indexer, value, self.name)
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/indexing.py in ?(self, indexer, value, name)
1940 if take_split_path:
1941 # We have to operate column-wise
1942 self._setitem_with_indexer_split_path(indexer, value, name)
1943 else:
-> 1944 self._setitem_single_block(indexer, value, name)
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/indexing.py in ?(self, indexer, value, name)
2211 if isinstance(value, ABCDataFrame) and name != "iloc":
2212 value = self._align_frame(indexer, value)._values
2213
2214 # check for chained assignment
-> 2215 self.obj._check_is_chained_assignment_possible()
2216
2217 # actually do the set
2218 self.obj._mgr = self.obj._mgr.setitem(indexer=indexer, value=value)
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/series.py in ?(self)
1489 ref = self._get_cacher()
1490 if ref is not None and ref._is_mixed_type:
1491 self._check_setitem_copy(t="referent", force=True)
1492 return True
-> 1493 return super()._check_is_chained_assignment_possible()
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/generic.py in ?(self)
4395 single-dtype meaning that the cacher should be updated following
4396 setting.
4397 """
4398 if self._is_copy:
-> 4399 self._check_setitem_copy(t="referent")
4400 return False
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/generic.py in ?(self, t, force)
4469 "indexing.html#returning-a-view-versus-a-copy"
4470 )
4471
4472 if value == "raise":
-> 4473 raise SettingWithCopyError(t)
4474 if value == "warn":
4475 warnings.warn(t, SettingWithCopyWarning, stacklevel=find_stack_level())
SettingWithCopyError:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
At first you will find this advice annoying. Your code will generate confusing errors, and you will be tempted to remove this error option to make the errors go away. Please be patient. You will find that, if you follow the copy right rule carefully, most of these errors go away.
Another digression: copy, views, on the left#
There is more discussion of this subject in the Gory Pandas page.
If you are reading this page from start to finish, you will have already seen our discussion of chained assignment above. Here we repeat ourselves a little for the sake our our less linear readers. Consider this code:
ratings.loc['A'].loc['Clarity'] = 299
/tmp/ipykernel_6218/1657792584.py:1: FutureWarning: ChainedAssignmentError: behaviour will change in pandas 3.0!
You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:
df["col"][row_indexer] = value
Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
ratings.loc['A'].loc['Clarity'] = 299
---------------------------------------------------------------------------
SettingWithCopyError Traceback (most recent call last)
/tmp/ipykernel_6218/1657792584.py in ?()
----> 1 ratings.loc['A'].loc['Clarity'] = 299
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/indexing.py in ?(self, key, value)
907 indexer = self._get_setitem_indexer(key)
908 self._has_valid_setitem_indexer(key)
909
910 iloc = self if self.name == "iloc" else self.obj.iloc
--> 911 iloc._setitem_with_indexer(indexer, value, self.name)
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/indexing.py in ?(self, indexer, value, name)
1940 if take_split_path:
1941 # We have to operate column-wise
1942 self._setitem_with_indexer_split_path(indexer, value, name)
1943 else:
-> 1944 self._setitem_single_block(indexer, value, name)
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/indexing.py in ?(self, indexer, value, name)
2211 if isinstance(value, ABCDataFrame) and name != "iloc":
2212 value = self._align_frame(indexer, value)._values
2213
2214 # check for chained assignment
-> 2215 self.obj._check_is_chained_assignment_possible()
2216
2217 # actually do the set
2218 self.obj._mgr = self.obj._mgr.setitem(indexer=indexer, value=value)
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/series.py in ?(self)
1489 ref = self._get_cacher()
1490 if ref is not None and ref._is_mixed_type:
1491 self._check_setitem_copy(t="referent", force=True)
1492 return True
-> 1493 return super()._check_is_chained_assignment_possible()
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/generic.py in ?(self)
4395 single-dtype meaning that the cacher should be updated following
4396 setting.
4397 """
4398 if self._is_copy:
-> 4399 self._check_setitem_copy(t="referent")
4400 return False
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/generic.py in ?(self, t, force)
4469 "indexing.html#returning-a-view-versus-a-copy"
4470 )
4471
4472 if value == "raise":
-> 4473 raise SettingWithCopyError(t)
4474 if value == "warn":
4475 warnings.warn(t, SettingWithCopyWarning, stacklevel=find_stack_level())
SettingWithCopyError:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
Because we have set the mode.chained_assignment
option to error
above, this
generates an error — but why?
The reason is the same as the reason for the previous error. The code in the cell directly above is just a short-cut for this exact equivalent.
tmp = ratings.loc['A']
tmp.loc['Clarity'] = 299
---------------------------------------------------------------------------
SettingWithCopyError Traceback (most recent call last)
/tmp/ipykernel_6218/2016657039.py in ?()
1 tmp = ratings.loc['A']
----> 2 tmp.loc['Clarity'] = 299
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/indexing.py in ?(self, key, value)
907 indexer = self._get_setitem_indexer(key)
908 self._has_valid_setitem_indexer(key)
909
910 iloc = self if self.name == "iloc" else self.obj.iloc
--> 911 iloc._setitem_with_indexer(indexer, value, self.name)
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/indexing.py in ?(self, indexer, value, name)
1940 if take_split_path:
1941 # We have to operate column-wise
1942 self._setitem_with_indexer_split_path(indexer, value, name)
1943 else:
-> 1944 self._setitem_single_block(indexer, value, name)
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/indexing.py in ?(self, indexer, value, name)
2211 if isinstance(value, ABCDataFrame) and name != "iloc":
2212 value = self._align_frame(indexer, value)._values
2213
2214 # check for chained assignment
-> 2215 self.obj._check_is_chained_assignment_possible()
2216
2217 # actually do the set
2218 self.obj._mgr = self.obj._mgr.setitem(indexer=indexer, value=value)
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/series.py in ?(self)
1489 ref = self._get_cacher()
1490 if ref is not None and ref._is_mixed_type:
1491 self._check_setitem_copy(t="referent", force=True)
1492 return True
-> 1493 return super()._check_is_chained_assignment_possible()
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/generic.py in ?(self)
4395 single-dtype meaning that the cacher should be updated following
4396 setting.
4397 """
4398 if self._is_copy:
-> 4399 self._check_setitem_copy(t="referent")
4400 return False
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/generic.py in ?(self, t, force)
4469 "indexing.html#returning-a-view-versus-a-copy"
4470 )
4471
4472 if value == "raise":
-> 4473 raise SettingWithCopyError(t)
4474 if value == "warn":
4475 warnings.warn(t, SettingWithCopyWarning, stacklevel=find_stack_level())
SettingWithCopyError:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
Specifically, when Python sees ratings.loc['A'].loc['Clarity'] = 299
, it first
evaluates ratings.loc['A']
to generate a temporary copy. In the code above, we
called this temporary copy tmp
. It then tries to set the value into the copy
with tmp.loc['Clarity'] = 299
. This generates the same error as you saw
before.
As you have probably guessed from the option name above, Pandas calls this
chained assignment, because you are: first, fetching the stuff you want do
the assignment on (ratings.loc['A']
) and then doing the assignment
.loc['Clarity'] = 299
. There are two steps on the left hand side, in a chain,
first fetching the data, then assigning.
The problem that Pandas has is that it cannot tell that this chained assignment
has happened, so it can’t tell what you mean. Python will ask Pandas to
generate ratings.loc['A']
first, which it does, to generate the temporary copy
that we can call tmp
. Python then asks Pandas to set the value with
tmp.loc['Clarity'] = 299
. When Pandas gets this second instruction, it has no
way of knowing that tmp
came from the combined instruction
ratings.loc['A'].loc['Clarity'] = 299
, and so all it can do is set the value
into the copy, as instructed.
This leads us to the last rule.
Old strategy rule 3: loc left#
When you do want to use indexing on the left hand side, to set some values into
a DataFrame or Series, try do to this all in one shot, using indirect indexing
with .loc
or iloc
.
For example, you have just seen that this generates an error, and why:
ratings.loc['A'].loc['Clarity'] = 299
/tmp/ipykernel_6218/1657792584.py:1: FutureWarning: ChainedAssignmentError: behaviour will change in pandas 3.0!
You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:
df["col"][row_indexer] = value
Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
ratings.loc['A'].loc['Clarity'] = 299
---------------------------------------------------------------------------
SettingWithCopyError Traceback (most recent call last)
/tmp/ipykernel_6218/1657792584.py in ?()
----> 1 ratings.loc['A'].loc['Clarity'] = 299
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/indexing.py in ?(self, key, value)
907 indexer = self._get_setitem_indexer(key)
908 self._has_valid_setitem_indexer(key)
909
910 iloc = self if self.name == "iloc" else self.obj.iloc
--> 911 iloc._setitem_with_indexer(indexer, value, self.name)
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/indexing.py in ?(self, indexer, value, name)
1940 if take_split_path:
1941 # We have to operate column-wise
1942 self._setitem_with_indexer_split_path(indexer, value, name)
1943 else:
-> 1944 self._setitem_single_block(indexer, value, name)
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/indexing.py in ?(self, indexer, value, name)
2211 if isinstance(value, ABCDataFrame) and name != "iloc":
2212 value = self._align_frame(indexer, value)._values
2213
2214 # check for chained assignment
-> 2215 self.obj._check_is_chained_assignment_possible()
2216
2217 # actually do the set
2218 self.obj._mgr = self.obj._mgr.setitem(indexer=indexer, value=value)
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/series.py in ?(self)
1489 ref = self._get_cacher()
1490 if ref is not None and ref._is_mixed_type:
1491 self._check_setitem_copy(t="referent", force=True)
1492 return True
-> 1493 return super()._check_is_chained_assignment_possible()
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/generic.py in ?(self)
4395 single-dtype meaning that the cacher should be updated following
4396 setting.
4397 """
4398 if self._is_copy:
-> 4399 self._check_setitem_copy(t="referent")
4400 return False
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/generic.py in ?(self, t, force)
4469 "indexing.html#returning-a-view-versus-a-copy"
4470 )
4471
4472 if value == "raise":
-> 4473 raise SettingWithCopyError(t)
4474 if value == "warn":
4475 warnings.warn(t, SettingWithCopyWarning, stacklevel=find_stack_level())
SettingWithCopyError:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
You can avoid that error by doing all your left-hand-side indexing in one shot, like this:
ratings.loc['A', 'Clarity'] = 299
ratings.loc['A']
Discipline English
Number of Professors 23343
Clarity 299.0
Helpfulness 3.821866
Overall Quality 3.791364
Easiness 3.162754
Name: A, dtype: object
Notice there is no error. This is because, in this second case, Pandas gets
all the instructions in one go. It can see from this combined instruction that
we meant to set the ‘Clarity’ value for the row labeled ‘A’ in the ratings
DataFrame, and does just this.
Old strategy summary: keep calm, follow the three rules#
Do not worry if some of this is not immediately clear; it is not easy.
The trick is to remember the three rules:
Copy right.
Make copy warnings into errors.
Use
.loc
and.iloc
for your left-hand-side indexing.