Value counts

Value counts#

Pandas Series have a useful method to count the number of each unique value in that Series.

# Load the Numpy array library, call it 'np'
import numpy as np
# Load the Pandas data science library, call it 'pd'
import pandas as pd
# Turn on a setting to use Pandas more safely.
# We will discuss this setting later.
pd.set_option('mode.copy_on_write', True)

Here we load a dataset with information on the passengers and crew on the RMS Titanic.

titanic = pd.read_csv('titanic_stlearn.csv')
titanic.head()

	name	gender	age	class	embarked	country	ticketno	fare	sibsp	parch	survived
0	Abbing, Mr. Anthony	male	42.0	3rd	Southampton	United States	5547.0	7.11	0.0	0.0	no
1	Abbott, Mr. Eugene Joseph	male	13.0	3rd	Southampton	United States	2673.0	20.05	0.0	2.0	no
2	Abbott, Mr. Rossmore Edward	male	16.0	3rd	Southampton	United States	2673.0	20.05	1.0	1.0	no
3	Abbott, Mrs. Rhoda Mary 'Rosa'	female	39.0	3rd	Southampton	England	2673.0	20.05	1.0	1.0	yes
4	Abelseth, Miss. Karen Marie	female	16.0	3rd	Southampton	Norway	348125.0	7.13	0.0	0.0	yes

Each row represents one person who was on the Titanic when she sank on 15th April 1912. Each column has a particular piece of data for those people. For example, the embarked column has the port at which each passenger or crew member joined the ship. The Titanic was built and launched in Belfast. She sailed from Belfast to Southampton on the 2nd April, where she picked up many passengers, and the rest of her crew. On 10th April, she started her maiden and only voyage, from Southampton to Cherbourg, and then Queenstown, in the south of Ireland, to pick up more passengers. She set off from Queenstown for New York on the 11th April.

Here is the Series representing the embarked column of the DataFrame:

embarked = titanic['embarked']
embarked

     Southampton
     Southampton
     Southampton
     Southampton
     Southampton
           ...     
      Belfast
  Southampton
  Southampton
  Southampton
  Southampton
Name: embarked, Length: 2207, dtype: object

Indexing Series with Boolean Series#

We might want to know how many people boarded at each of these ports. For example, we might want to know how many people joined the Titanic at Belfast. We could do this with a Boolean Series, like this:

Make a Boolean Series with True where embarked == 'Belfast
Use the .sum method of the Boolean Series to count how many True values there were.

# Make a Boolean Series with True for Belfast.
is_belfast = embarked == 'Belfast'
is_belfast

     False
     False
     False
     False
     False
        ...  
   True
  False
  False
  False
  False
Name: embarked, Length: 2207, dtype: bool

Series have a sum method that sums up the values in the Series. As you may remember from np.sum and arrays, True counts as 1 and False counts as 0, so the sum of the Series values is the same as the number of True values in the Series:

# How many Belfast values are there?
is_belfast.sum()

Value counts#

The approach above works for Belfast, and of course we could apply the same procedure for each port — Southampton, Cherbourg, Queenstown — but that starts to get verbose.

Luckily, the value_counts method of a Series does this job for us. It:

Identifies all the unique values in the Series, and then
Counts how many of each of these values exist in the Series.

emb_counts = embarked.value_counts()
emb_counts

embarked
Southampton    1616
Cherbourg       271
Belfast         197
Queenstown      123
Name: count, dtype: int64

In our case, value_counts has:

Found all the unique ports in embarked and
Counted how many values there are for each port.

Notice that it sorts the unique values by the counts, highest count first.

The result of `value_counts` is a new Series#

Notice that the result that comes back from value_counts is itself a Series:

type(emb_counts)

pandas.core.series.Series

As for any Series, it has values and labels (the “index”):

# The values
emb_counts.values

array([1616,  271,  197,  123])

# The labels
emb_counts.index

Index(['Southampton', 'Cherbourg', 'Belfast', 'Queenstown'], dtype='object', name='embarked')

Like all Series, the result of value_counts has a sum method that gives the sum of the values:

emb_counts.sum()

In our case, because each person does have a recorded (not-missing) embarkation port, the sum of the counts is same as the number of people (values) in the original embarked Series:

len(embarked)

However, this is not always true — sometimes there are missing values, and in that case, the sum of the value counts will not add up to the number of values in the series.

For example, consider the column giving the nationality of each person:

countries = titanic['country']
countries

     United States
     United States
     United States
           England
            Norway
            ...      
        England
        England
        England
        England
        England
Name: country, Length: 2207, dtype: object

Here we count how many people there were of each nationality:

c_counts = countries.value_counts()
c_counts

country
England                  1125
United States             264
Ireland                   137
Sweden                    105
Lebanon                    71
Finland                    54
Scotland                   36
Canada                     34
Norway                     26
France                     26
Belgium                    22
Northern Ireland           21
Wales                      20
Bulgaria                   19
Switzerland                18
Channel Islands            17
Croatia (Modern)           12
Croatia                    11
Italy                      11
Spain                       9
India                       8
Hungary                     7
Denmark                     7
Argentina                   7
South Africa                6
Germany                     6
Turkey                      6
Australia                   5
Slovenia                    4
Bosnia                      4
Poland                      3
Austria                     3
Netherlands                 2
Greece                      2
Uruguay                     2
Russia                      2
Peru                        2
Siam                        2
Syria                       1
China/Hong Kong             1
Japan                       1
Latvia                      1
Yugoslavia                  1
Slovakia (Modern day)       1
Egypt                       1
Cuba                        1
Mexico                      1
Guyana                      1
Name: count, dtype: int64

In this case — there were some people for whom we do not know their nationality — they have missing values for country. For that reason, the sum of the value counts is less than the number of values.

c_count_sum = c_counts.sum()
c_count_sum

That means that there were a few missing values:

n_missing = len(countries) - c_count_sum
n_missing

`count` counts the number of not-missing values#

The count method of a series does this count directly. That is, it counts the number of values that are not missing

countries.count()

countries.count() == c_count_sum

True

Value counts

Contents

Value counts#

Indexing Series with Boolean Series#

Value counts#

The result of value_counts is a new Series#

count counts the number of not-missing values#

The result of `value_counts` is a new Series#

`count` counts the number of not-missing values#