Numbers and strings#
This page covers a common problem when loading data into Pandas — when Pandas gets confused about whether values in a column are text or numbers.
An example#
import numpy as np
import pandas as pd
pd.set_option('mode.copy_on_write', True)
We return to the example data file that you may have seen in the text encoding page.
You can download the data file from imdblet_latin.csv
.
films = pd.read_csv('imdblet_latin.csv', encoding='latin1')
films.head()
Votes | Rating | Title | Year | Decade | |
---|---|---|---|---|---|
0 | 635139 | 8.6 | Léon | 1994 | 1990 |
1 | 264285 | 8.1 | The Princess Bride | 1987 | 1980 |
2 | 43090 | N/K | Paris, Texas (1984) | 1984 | 1980 |
3 | 90434 | 8.3 | Rashômon | 1950 | 1950 |
4 | 427099 | 8.0 | X-Men: Days of Future Past | 2014 | 2010 |
Now imagine we are interested in the average rating across these films:
ratings = films['Rating']
ratings.mean()
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[3], line 2
1 ratings = films['Rating']
----> 2 ratings.mean()
File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/series.py:6549, in Series.mean(self, axis, skipna, numeric_only, **kwargs)
6541 @doc(make_doc("mean", ndim=1))
6542 def mean(
6543 self,
(...)
6547 **kwargs,
6548 ):
-> 6549 return NDFrame.mean(self, axis, skipna, numeric_only, **kwargs)
File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/generic.py:12420, in NDFrame.mean(self, axis, skipna, numeric_only, **kwargs)
12413 def mean(
12414 self,
12415 axis: Axis | None = 0,
(...)
12418 **kwargs,
12419 ) -> Series | float:
> 12420 return self._stat_function(
12421 "mean", nanops.nanmean, axis, skipna, numeric_only, **kwargs
12422 )
File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/generic.py:12377, in NDFrame._stat_function(self, name, func, axis, skipna, numeric_only, **kwargs)
12373 nv.validate_func(name, (), kwargs)
12375 validate_bool_kwarg(skipna, "skipna", none_allowed=False)
> 12377 return self._reduce(
12378 func, name=name, axis=axis, skipna=skipna, numeric_only=numeric_only
12379 )
File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/series.py:6457, in Series._reduce(self, op, name, axis, skipna, numeric_only, filter_type, **kwds)
6452 # GH#47500 - change to TypeError to match other methods
6453 raise TypeError(
6454 f"Series.{name} does not allow {kwd_name}={numeric_only} "
6455 "with non-numeric dtypes."
6456 )
-> 6457 return op(delegate, skipna=skipna, **kwds)
File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/nanops.py:147, in bottleneck_switch.__call__.<locals>.f(values, axis, skipna, **kwds)
145 result = alt(values, axis=axis, skipna=skipna, **kwds)
146 else:
--> 147 result = alt(values, axis=axis, skipna=skipna, **kwds)
149 return result
File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/nanops.py:404, in _datetimelike_compat.<locals>.new_func(values, axis, skipna, mask, **kwargs)
401 if datetimelike and mask is None:
402 mask = isna(values)
--> 404 result = func(values, axis=axis, skipna=skipna, mask=mask, **kwargs)
406 if datetimelike:
407 result = _wrap_results(result, orig_values.dtype, fill_value=iNaT)
File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/nanops.py:720, in nanmean(values, axis, skipna, mask)
718 count = _get_counts(values.shape, mask, axis, dtype=dtype_count)
719 the_sum = values.sum(axis, dtype=dtype_sum)
--> 720 the_sum = _ensure_numeric(the_sum)
722 if axis is not None and getattr(the_sum, "ndim", False):
723 count = cast(np.ndarray, count)
File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/nanops.py:1701, in _ensure_numeric(x)
1698 elif not (is_float(x) or is_integer(x) or is_complex(x)):
1699 if isinstance(x, str):
1700 # GH#44008, GH#36703 avoid casting e.g. strings to numeric
-> 1701 raise TypeError(f"Could not convert string '{x}' to numeric")
1702 try:
1703 x = float(x)
TypeError: Could not convert string '8.68.1N/K8.38.08.18.08.48.28.08.08.08.68.68.08.28.48.48.18.38.48.28.58.08.28.18.48.18.68.48.18.78.1' to numeric
The problem#
The problem is that we were expecting our ratings to be numbers, but in fact, they are strings.
We can see what type of thing Pandas has stored by looking at the dtype
attribute of a Series, or the dtypes
attribute of a data frame.
films.dtypes
Votes object
Rating object
Title object
Year int64
Decade int64
dtype: object
ratings.dtype
dtype('O')
In fact both these bits of information say the same thing – that the ‘Rating’ column stores things in the “object” or “O” type. This is a general type that can store any Python value. It is the standard type that Pandas uses when storing text.
Why does Pandas use text for the ‘Rating’ column?
A quick look at the first rows gives the answer:
ratings.head()
0 8.6
1 8.1
2 N/K
3 8.3
4 8.0
Name: Rating, dtype: object
The film “Paris, Texas (1984)” has a value “N/K” for the rating. This can’t be a number, so Pandas stored this column in a format that allows it to store “N/K” as text.
If that wasn’t obvious, another way of checking where the problem value is, to apply
the function float
to the column values.
When we apply
a function to a Series, it does this:
For each value in the Series it:
Calls the function, with the value as the single argument.
Collects the new value returned from the function, and appends it to a new Series.
Returns the new Series.
The result is a Series that is the same length as the original series, but where each value in the new series is the result of calling the function on the original value.
Recall that the float
function converts the thing you pass into a floating
point value:
v = float('3.14')
v
3.14
type(v)
float
Now we try applying float
to the problematic column:
ratings.apply(float)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[9], line 1
----> 1 ratings.apply(float)
File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/series.py:4924, in Series.apply(self, func, convert_dtype, args, by_row, **kwargs)
4789 def apply(
4790 self,
4791 func: AggFuncType,
(...)
4796 **kwargs,
4797 ) -> DataFrame | Series:
4798 """
4799 Invoke function on values of Series.
4800
(...)
4915 dtype: float64
4916 """
4917 return SeriesApply(
4918 self,
4919 func,
4920 convert_dtype=convert_dtype,
4921 by_row=by_row,
4922 args=args,
4923 kwargs=kwargs,
-> 4924 ).apply()
File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/apply.py:1427, in SeriesApply.apply(self)
1424 return self.apply_compat()
1426 # self.func is Callable
-> 1427 return self.apply_standard()
File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/apply.py:1507, in SeriesApply.apply_standard(self)
1501 # row-wise access
1502 # apply doesn't have a `na_action` keyword and for backward compat reasons
1503 # we need to give `na_action="ignore"` for categorical data.
1504 # TODO: remove the `na_action="ignore"` when that default has been changed in
1505 # Categorical (GH51645).
1506 action = "ignore" if isinstance(obj.dtype, CategoricalDtype) else None
-> 1507 mapped = obj._map_values(
1508 mapper=curried, na_action=action, convert=self.convert_dtype
1509 )
1511 if len(mapped) and isinstance(mapped[0], ABCSeries):
1512 # GH#43986 Need to do list(mapped) in order to get treated as nested
1513 # See also GH#25959 regarding EA support
1514 return obj._constructor_expanddim(list(mapped), index=obj.index)
File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/base.py:921, in IndexOpsMixin._map_values(self, mapper, na_action, convert)
918 if isinstance(arr, ExtensionArray):
919 return arr.map(mapper, na_action=na_action)
--> 921 return algorithms.map_array(arr, mapper, na_action=na_action, convert=convert)
File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/algorithms.py:1743, in map_array(arr, mapper, na_action, convert)
1741 values = arr.astype(object, copy=False)
1742 if na_action is None:
-> 1743 return lib.map_infer(values, mapper, convert=convert)
1744 else:
1745 return lib.map_infer_mask(
1746 values, mapper, mask=isna(values).view(np.uint8), convert=convert
1747 )
File lib.pyx:2972, in pandas._libs.lib.map_infer()
ValueError: could not convert string to float: 'N/K'
One way of dealing with this problem is to make a recoding function.
A recoding function is a function that we will apply to a Series. That means that we call the function for every value in the Series. The function argument is the value from the series. The function returns the new value, for a new Series.
def recode_ratings(v):
if v == 'N/K': # Return missing value for 'N/K'
return np.nan
# Otherwise make text value into a float
return float(v)
We test our function:
recode_ratings('8.3')
8.3
recode_ratings('N/K')
nan
We make a new Series by calling the recode function:
new_ratings = ratings.apply(recode_ratings)
new_ratings.head()
0 8.6
1 8.1
2 NaN
3 8.3
4 8.0
Name: Rating, dtype: float64
We can insert this back into a copy of the original data frame:
films_fixed = films
films_fixed.loc[:, 'Rating'] = new_ratings
films_fixed.head()
Votes | Rating | Title | Year | Decade | |
---|---|---|---|---|---|
0 | 635139 | 8.6 | Léon | 1994 | 1990 |
1 | 264285 | 8.1 | The Princess Bride | 1987 | 1980 |
2 | 43090 | NaN | Paris, Texas (1984) | 1984 | 1980 |
3 | 90434 | 8.3 | Rashômon | 1950 | 1950 |
4 | 427099 | 8.0 | X-Men: Days of Future Past | 2014 | 2010 |