Value counts#

Pandas Series have a useful method to count the number of each unique value in that Series.

# Load the Numpy array library, call it 'np'
import numpy as np
# Load the Pandas data science library, call it 'pd'
import pandas as pd
# Turn on a setting to use Pandas more safely.
# We will discuss this setting later.
pd.set_option('mode.copy_on_write', True)

Here we load a dataset with information on the passengers and crew on the RMS Titanic.

titanic = pd.read_csv('titanic_stlearn.csv')
titanic.head()
name gender age class embarked country ticketno fare sibsp parch survived
0 Abbing, Mr. Anthony male 42.0 3rd Southampton United States 5547.0 7.11 0.0 0.0 no
1 Abbott, Mr. Eugene Joseph male 13.0 3rd Southampton United States 2673.0 20.05 0.0 2.0 no
2 Abbott, Mr. Rossmore Edward male 16.0 3rd Southampton United States 2673.0 20.05 1.0 1.0 no
3 Abbott, Mrs. Rhoda Mary 'Rosa' female 39.0 3rd Southampton England 2673.0 20.05 1.0 1.0 yes
4 Abelseth, Miss. Karen Marie female 16.0 3rd Southampton Norway 348125.0 7.13 0.0 0.0 yes

Each row represents one person who was on the Titanic when she sank on 15th April 1912. Each column has a particular piece of data for those people. For example, the embarked column has the port at which each passenger or crew member joined the ship. The Titanic was built and launched in Belfast. She sailed from Belfast to Southampton on the 2nd April, where she picked up many passengers, and the rest of her crew. On 10th April, she started her maiden and only voyage, from Southampton to Cherbourg, and then Queenstown, in the south of Ireland, to pick up more passengers. She set off from Queenstown for New York on the 11th April.

Here is the Series representing the embarked column of the DataFrame:

embarked = titanic['embarked']
embarked
0       Southampton
1       Southampton
2       Southampton
3       Southampton
4       Southampton
           ...     
2202        Belfast
2203    Southampton
2204    Southampton
2205    Southampton
2206    Southampton
Name: embarked, Length: 2207, dtype: object

Indexing Series with Boolean Series#

We might want to know how many people boarded at each of these ports. For example, we might want to know how many people joined the Titanic at Belfast. We could do this with a Boolean Series, like this:

  • Make a Boolean Series with True where embarked == 'Belfast

  • Use the .sum method of the Boolean Series to count how many True values there were.

# Make a Boolean Series with True for Belfast.
is_belfast = embarked == 'Belfast'
is_belfast
0       False
1       False
2       False
3       False
4       False
        ...  
2202     True
2203    False
2204    False
2205    False
2206    False
Name: embarked, Length: 2207, dtype: bool

Series have a sum method that sums up the values in the Series. As you may remember from np.sum and arrays, True counts as 1 and False counts as 0, so the sum of the Series values is the same as the number of True values in the Series:

# How many Belfast values are there?
is_belfast.sum()
197

Value counts#

The approach above works for Belfast, and of course we could apply the same procedure for each port — Southampton, Cherbourg, Queenstown — but that starts to get verbose.

Luckily, the value_counts method of a Series does this job for us. It:

  • Identifies all the unique values in the Series, and then

  • Counts how many of each of these values exist in the Series.

emb_counts = embarked.value_counts()
emb_counts
embarked
Southampton    1616
Cherbourg       271
Belfast         197
Queenstown      123
Name: count, dtype: int64

In our case, value_counts has:

  • Found all the unique ports in embarked and

  • Counted how many values there are for each port.

Notice that it sorts the unique values by the counts, highest count first.

The result of value_counts is a new Series#

Notice that the result that comes back from value_counts is itself a Series:

type(emb_counts)
pandas.core.series.Series

As for any Series, it has values and labels (the “index”):

# The values
emb_counts.values
array([1616,  271,  197,  123])
# The labels
emb_counts.index
Index(['Southampton', 'Cherbourg', 'Belfast', 'Queenstown'], dtype='object', name='embarked')

Like all Series, the result of value_counts has a sum method that gives the sum of the values:

emb_counts.sum()
2207

In our case, because each person does have a recorded (not-missing) embarkation port, the sum of the counts is same as the number of people (values) in the original embarked Series:

len(embarked)
2207

However, this is not always true — sometimes there are missing values, and in that case, the sum of the value counts will not add up to the number of values in the series.

For example, consider the column giving the nationality of each person:

countries = titanic['country']
countries
0       United States
1       United States
2       United States
3             England
4              Norway
            ...      
2202          England
2203          England
2204          England
2205          England
2206          England
Name: country, Length: 2207, dtype: object

Here we count how many people there were of each nationality:

c_counts = countries.value_counts()
c_counts
country
England                  1125
United States             264
Ireland                   137
Sweden                    105
Lebanon                    71
Finland                    54
Scotland                   36
Canada                     34
Norway                     26
France                     26
Belgium                    22
Northern Ireland           21
Wales                      20
Bulgaria                   19
Switzerland                18
Channel Islands            17
Croatia (Modern)           12
Croatia                    11
Italy                      11
Spain                       9
India                       8
Hungary                     7
Denmark                     7
Argentina                   7
South Africa                6
Germany                     6
Turkey                      6
Australia                   5
Slovenia                    4
Bosnia                      4
Poland                      3
Austria                     3
Netherlands                 2
Greece                      2
Uruguay                     2
Russia                      2
Peru                        2
Siam                        2
Syria                       1
China/Hong Kong             1
Japan                       1
Latvia                      1
Yugoslavia                  1
Slovakia (Modern day)       1
Egypt                       1
Cuba                        1
Mexico                      1
Guyana                      1
Name: count, dtype: int64

In this case — there were some people for whom we do not know their nationality — they have missing values for country. For that reason, the sum of the value counts is less than the number of values.

c_count_sum = c_counts.sum()
c_count_sum
2126

That means that there were a few missing values:

n_missing = len(countries) - c_count_sum
n_missing
81

count counts the number of not-missing values#

The count method of a series does this count directly. That is, it counts the number of values that are not missing

countries.count()
2126
countries.count() == c_count_sum
True