Numbers and strings

Numbers and strings#

This page covers a common problem when loading data into Pandas — when Pandas gets confused about whether values in a column are text or numbers.

An example#

import numpy as np
import pandas as pd
pd.set_option('mode.copy_on_write', True)

We return to the example data file that you may have seen in the text encoding page.

You can download the data file from imdblet_latin.csv.

films = pd.read_csv('imdblet_latin.csv', encoding='latin1')
films.head()
Votes Rating Title Year Decade
0 635139 8.6 Léon 1994 1990
1 264285 8.1 The Princess Bride 1987 1980
2 43090 N/K Paris, Texas (1984) 1984 1980
3 90434 8.3 Rashômon 1950 1950
4 427099 8.0 X-Men: Days of Future Past 2014 2010

Now imagine we are interested in the average rating across these films:

ratings = films['Rating']
ratings.mean()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[3], line 2
      1 ratings = films['Rating']
----> 2 ratings.mean()

File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/series.py:6549, in Series.mean(self, axis, skipna, numeric_only, **kwargs)
   6541 @doc(make_doc("mean", ndim=1))
   6542 def mean(
   6543     self,
   (...)
   6547     **kwargs,
   6548 ):
-> 6549     return NDFrame.mean(self, axis, skipna, numeric_only, **kwargs)

File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/generic.py:12420, in NDFrame.mean(self, axis, skipna, numeric_only, **kwargs)
  12413 def mean(
  12414     self,
  12415     axis: Axis | None = 0,
   (...)
  12418     **kwargs,
  12419 ) -> Series | float:
> 12420     return self._stat_function(
  12421         "mean", nanops.nanmean, axis, skipna, numeric_only, **kwargs
  12422     )

File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/generic.py:12377, in NDFrame._stat_function(self, name, func, axis, skipna, numeric_only, **kwargs)
  12373 nv.validate_func(name, (), kwargs)
  12375 validate_bool_kwarg(skipna, "skipna", none_allowed=False)
> 12377 return self._reduce(
  12378     func, name=name, axis=axis, skipna=skipna, numeric_only=numeric_only
  12379 )

File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/series.py:6457, in Series._reduce(self, op, name, axis, skipna, numeric_only, filter_type, **kwds)
   6452     # GH#47500 - change to TypeError to match other methods
   6453     raise TypeError(
   6454         f"Series.{name} does not allow {kwd_name}={numeric_only} "
   6455         "with non-numeric dtypes."
   6456     )
-> 6457 return op(delegate, skipna=skipna, **kwds)

File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/nanops.py:147, in bottleneck_switch.__call__.<locals>.f(values, axis, skipna, **kwds)
    145         result = alt(values, axis=axis, skipna=skipna, **kwds)
    146 else:
--> 147     result = alt(values, axis=axis, skipna=skipna, **kwds)
    149 return result

File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/nanops.py:404, in _datetimelike_compat.<locals>.new_func(values, axis, skipna, mask, **kwargs)
    401 if datetimelike and mask is None:
    402     mask = isna(values)
--> 404 result = func(values, axis=axis, skipna=skipna, mask=mask, **kwargs)
    406 if datetimelike:
    407     result = _wrap_results(result, orig_values.dtype, fill_value=iNaT)

File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/nanops.py:720, in nanmean(values, axis, skipna, mask)
    718 count = _get_counts(values.shape, mask, axis, dtype=dtype_count)
    719 the_sum = values.sum(axis, dtype=dtype_sum)
--> 720 the_sum = _ensure_numeric(the_sum)
    722 if axis is not None and getattr(the_sum, "ndim", False):
    723     count = cast(np.ndarray, count)

File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/nanops.py:1701, in _ensure_numeric(x)
   1698 elif not (is_float(x) or is_integer(x) or is_complex(x)):
   1699     if isinstance(x, str):
   1700         # GH#44008, GH#36703 avoid casting e.g. strings to numeric
-> 1701         raise TypeError(f"Could not convert string '{x}' to numeric")
   1702     try:
   1703         x = float(x)

TypeError: Could not convert string '8.68.1N/K8.38.08.18.08.48.28.08.08.08.68.68.08.28.48.48.18.38.48.28.58.08.28.18.48.18.68.48.18.78.1' to numeric

The problem#

The problem is that we were expecting our ratings to be numbers, but in fact, they are strings.

We can see what type of thing Pandas has stored by looking at the dtype attribute of a Series, or the dtypes attribute of a data frame.

films.dtypes
Votes     object
Rating    object
Title     object
Year       int64
Decade     int64
dtype: object
ratings.dtype
dtype('O')

In fact both these bits of information say the same thing – that the ‘Rating’ column stores things in the “object” or “O” type. This is a general type that can store any Python value. It is the standard type that Pandas uses when storing text.

Why does Pandas use text for the ‘Rating’ column?

A quick look at the first rows gives the answer:

ratings.head()
0    8.6
1    8.1
2    N/K
3    8.3
4    8.0
Name: Rating, dtype: object

The film “Paris, Texas (1984)” has a value “N/K” for the rating. This can’t be a number, so Pandas stored this column in a format that allows it to store “N/K” as text.

If that wasn’t obvious, another way of checking where the problem value is, to apply the function float to the column values.

When we apply a function to a Series, it does this:

  • For each value in the Series it:

    • Calls the function, with the value as the single argument.

    • Collects the new value returned from the function, and appends it to a new Series.

  • Returns the new Series.

The result is a Series that is the same length as the original series, but where each value in the new series is the result of calling the function on the original value.

Recall that the float function converts the thing you pass into a floating point value:

v = float('3.14')
v
3.14
type(v)
float

Now we try applying float to the problematic column:

ratings.apply(float)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[9], line 1
----> 1 ratings.apply(float)

File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/series.py:4924, in Series.apply(self, func, convert_dtype, args, by_row, **kwargs)
   4789 def apply(
   4790     self,
   4791     func: AggFuncType,
   (...)
   4796     **kwargs,
   4797 ) -> DataFrame | Series:
   4798     """
   4799     Invoke function on values of Series.
   4800 
   (...)
   4915     dtype: float64
   4916     """
   4917     return SeriesApply(
   4918         self,
   4919         func,
   4920         convert_dtype=convert_dtype,
   4921         by_row=by_row,
   4922         args=args,
   4923         kwargs=kwargs,
-> 4924     ).apply()

File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/apply.py:1427, in SeriesApply.apply(self)
   1424     return self.apply_compat()
   1426 # self.func is Callable
-> 1427 return self.apply_standard()

File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/apply.py:1507, in SeriesApply.apply_standard(self)
   1501 # row-wise access
   1502 # apply doesn't have a `na_action` keyword and for backward compat reasons
   1503 # we need to give `na_action="ignore"` for categorical data.
   1504 # TODO: remove the `na_action="ignore"` when that default has been changed in
   1505 #  Categorical (GH51645).
   1506 action = "ignore" if isinstance(obj.dtype, CategoricalDtype) else None
-> 1507 mapped = obj._map_values(
   1508     mapper=curried, na_action=action, convert=self.convert_dtype
   1509 )
   1511 if len(mapped) and isinstance(mapped[0], ABCSeries):
   1512     # GH#43986 Need to do list(mapped) in order to get treated as nested
   1513     #  See also GH#25959 regarding EA support
   1514     return obj._constructor_expanddim(list(mapped), index=obj.index)

File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/base.py:921, in IndexOpsMixin._map_values(self, mapper, na_action, convert)
    918 if isinstance(arr, ExtensionArray):
    919     return arr.map(mapper, na_action=na_action)
--> 921 return algorithms.map_array(arr, mapper, na_action=na_action, convert=convert)

File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/algorithms.py:1743, in map_array(arr, mapper, na_action, convert)
   1741 values = arr.astype(object, copy=False)
   1742 if na_action is None:
-> 1743     return lib.map_infer(values, mapper, convert=convert)
   1744 else:
   1745     return lib.map_infer_mask(
   1746         values, mapper, mask=isna(values).view(np.uint8), convert=convert
   1747     )

File lib.pyx:2972, in pandas._libs.lib.map_infer()

ValueError: could not convert string to float: 'N/K'

One way of dealing with this problem is to make a recoding function.

A recoding function is a function that we will apply to a Series. That means that we call the function for every value in the Series. The function argument is the value from the series. The function returns the new value, for a new Series.

def recode_ratings(v):
    if v == 'N/K':  # Return missing value for 'N/K'
        return np.nan
    # Otherwise make text value into a float
    return float(v)

We test our function:

recode_ratings('8.3')
8.3
recode_ratings('N/K')
nan

We make a new Series by calling the recode function:

new_ratings = ratings.apply(recode_ratings)
new_ratings.head()
0    8.6
1    8.1
2    NaN
3    8.3
4    8.0
Name: Rating, dtype: float64

We can insert this back into a copy of the original data frame:

films_fixed = films
films_fixed.loc[:, 'Rating'] = new_ratings
films_fixed.head()
Votes Rating Title Year Decade
0 635139 8.6 Léon 1994 1990
1 264285 8.1 The Princess Bride 1987 1980
2 43090 NaN Paris, Texas (1984) 1984 1980
3 90434 8.3 Rashômon 1950 1950
4 427099 8.0 X-Men: Days of Future Past 2014 2010