Value counts#
Pandas Series have a useful method to count the number of each unique value in that Series.
# Load the Numpy array library, call it 'np'
import numpy as np
# Load the Pandas data science library, call it 'pd'
import pandas as pd
# Turn on a setting to use Pandas more safely.
# We will discuss this setting later.
pd.set_option('mode.copy_on_write', True)
Here we load a dataset with information on the passengers and crew on the RMS Titanic.
titanic = pd.read_csv('titanic_stlearn.csv')
titanic.head()
name | gender | age | class | embarked | country | ticketno | fare | sibsp | parch | survived | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Abbing, Mr. Anthony | male | 42.0 | 3rd | Southampton | United States | 5547.0 | 7.11 | 0.0 | 0.0 | no |
1 | Abbott, Mr. Eugene Joseph | male | 13.0 | 3rd | Southampton | United States | 2673.0 | 20.05 | 0.0 | 2.0 | no |
2 | Abbott, Mr. Rossmore Edward | male | 16.0 | 3rd | Southampton | United States | 2673.0 | 20.05 | 1.0 | 1.0 | no |
3 | Abbott, Mrs. Rhoda Mary 'Rosa' | female | 39.0 | 3rd | Southampton | England | 2673.0 | 20.05 | 1.0 | 1.0 | yes |
4 | Abelseth, Miss. Karen Marie | female | 16.0 | 3rd | Southampton | Norway | 348125.0 | 7.13 | 0.0 | 0.0 | yes |
Each row represents one person who was on the Titanic when she sank on 15th
April 1912. Each column has a particular piece of data for those people. For
example, the embarked
column has the port at which each passenger or crew
member joined the ship. The Titanic was built and launched in Belfast. She
sailed from Belfast to
Southampton
on the 2nd April, where she picked up many passengers, and the rest of her
crew. On 10th April, she started her maiden and only voyage, from Southampton
to Cherbourg, and then Queenstown, in the south of Ireland, to pick up more
passengers. She set off from Queenstown for New York on the 11th April.
Here is the Series representing the embarked
column of the DataFrame:
embarked = titanic['embarked']
embarked
0 Southampton
1 Southampton
2 Southampton
3 Southampton
4 Southampton
...
2202 Belfast
2203 Southampton
2204 Southampton
2205 Southampton
2206 Southampton
Name: embarked, Length: 2207, dtype: object
Indexing Series with Boolean Series#
We might want to know how many people boarded at each of these ports. For example, we might want to know how many people joined the Titanic at Belfast. We could do this with a Boolean Series, like this:
Make a Boolean Series with True where
embarked
=='Belfast
Use the
.sum
method of the Boolean Series to count how many True values there were.
# Make a Boolean Series with True for Belfast.
is_belfast = embarked == 'Belfast'
is_belfast
0 False
1 False
2 False
3 False
4 False
...
2202 True
2203 False
2204 False
2205 False
2206 False
Name: embarked, Length: 2207, dtype: bool
Series have a sum
method that sums up the values in the Series. As you may remember from np.sum
and arrays, True counts as 1 and False counts as 0, so the sum of the Series values is the same as the number of True values in the Series:
# How many Belfast values are there?
is_belfast.sum()
197
Value counts#
The approach above works for Belfast, and of course we could apply the same procedure for each port — Southampton, Cherbourg, Queenstown — but that starts to get verbose.
Luckily, the value_counts
method of a Series does this job for us. It:
Identifies all the unique values in the Series, and then
Counts how many of each of these values exist in the Series.
emb_counts = embarked.value_counts()
emb_counts
embarked
Southampton 1616
Cherbourg 271
Belfast 197
Queenstown 123
Name: count, dtype: int64
In our case, value_counts
has:
Found all the unique ports in
embarked
andCounted how many values there are for each port.
Notice that it sorts the unique values by the counts, highest count first.
The result of value_counts
is a new Series#
Notice that the result that comes back from value_counts
is itself a Series:
type(emb_counts)
pandas.core.series.Series
As for any Series, it has values and labels (the “index”):
# The values
emb_counts.values
array([1616, 271, 197, 123])
# The labels
emb_counts.index
Index(['Southampton', 'Cherbourg', 'Belfast', 'Queenstown'], dtype='object', name='embarked')
Like all Series, the result of value_counts
has a sum
method that gives the sum of the values:
emb_counts.sum()
2207
In our case, because each person does have a recorded (not-missing) embarkation port, the sum of the counts is same as the number of people (values) in the original embarked
Series:
len(embarked)
2207
However, this is not always true — sometimes there are missing values, and in that case, the sum
of the value counts will not add up to the number of values in the series.
For example, consider the column giving the nationality of each person:
countries = titanic['country']
countries
0 United States
1 United States
2 United States
3 England
4 Norway
...
2202 England
2203 England
2204 England
2205 England
2206 England
Name: country, Length: 2207, dtype: object
Here we count how many people there were of each nationality:
c_counts = countries.value_counts()
c_counts
country
England 1125
United States 264
Ireland 137
Sweden 105
Lebanon 71
Finland 54
Scotland 36
Canada 34
Norway 26
France 26
Belgium 22
Northern Ireland 21
Wales 20
Bulgaria 19
Switzerland 18
Channel Islands 17
Croatia (Modern) 12
Croatia 11
Italy 11
Spain 9
India 8
Hungary 7
Denmark 7
Argentina 7
South Africa 6
Germany 6
Turkey 6
Australia 5
Slovenia 4
Bosnia 4
Poland 3
Austria 3
Netherlands 2
Greece 2
Uruguay 2
Russia 2
Peru 2
Siam 2
Syria 1
China/Hong Kong 1
Japan 1
Latvia 1
Yugoslavia 1
Slovakia (Modern day) 1
Egypt 1
Cuba 1
Mexico 1
Guyana 1
Name: count, dtype: int64
In this case — there were some people for whom we do not know their nationality — they have missing values for country
. For that reason, the sum
of the value counts is less than the number of values.
c_count_sum = c_counts.sum()
c_count_sum
2126
That means that there were a few missing values:
n_missing = len(countries) - c_count_sum
n_missing
81
count
counts the number of not-missing values#
The count
method of a series does this count directly. That is, it counts the number of values that are not missing
countries.count()
2126
countries.count() == c_count_sum
True