Series are like arrays#
In this page, we look at Pandas’ Series. Series are the Pandas type that represents a column of data.
# Load the Numpy array library, call it 'np'
import numpy as np
# Load the Pandas data science library, call it 'pd'
import pandas as pd
# Turn on a setting to use Pandas more safely.
# We will discuss this setting later.
pd.set_option('mode.copy_on_write', True)
# Load the library for plotting, name it 'plt'
import matplotlib.pyplot as plt
# Make plots look a little more fancy
plt.style.use('fivethirtyeight')
We return to our original data frame, with the missing values dropped, and the rows labels with the country codes:
# Original data frame before dropping missing values.
gender_data = pd.read_csv('gender_stats_min.csv')
gender_data_no_na = gender_data.dropna()
labeled_gdata = gender_data_no_na.set_index('country_code')
labeled_gdata.head()
country_name | gdp_us_billion | mat_mort_ratio | population | |
---|---|---|---|---|
country_code | ||||
AFG | Afghanistan | 19.961015 | 444.00 | 32.715838 |
AGO | Angola | 111.936542 | 501.25 | 26.937545 |
ALB | Albania | 12.327586 | 29.25 | 2.888280 |
ARE | United Arab Emirates | 375.027082 | 6.00 | 9.080299 |
ARG | Argentina | 550.980968 | 53.75 | 42.976675 |
We found that there was a rather unconvincing relationship between the GDP values, and the Maternal Mortality Rate (MMR) values.
First we fetch those values from their corresponding DataFrame columns, using direct indexing with column labels:
gdp = labeled_gdata['gdp_us_billion']
gdp
country_code
AFG 19.961015
AGO 111.936542
ALB 12.327586
ARE 375.027082
ARG 550.980968
...
WSM 0.799887
YEM 36.819337
ZAF 345.209888
ZMB 24.280990
ZWE 15.495514
Name: gdp_us_billion, Length: 179, dtype: float64
mmr = labeled_gdata['mat_mort_ratio']
mmr
country_code
AFG 444.00
AGO 501.25
ALB 29.25
ARE 6.00
ARG 53.75
...
WSM 54.75
YEM 399.75
ZAF 143.75
ZMB 233.75
ZWE 398.00
Name: mat_mort_ratio, Length: 179, dtype: float64
We plot the two Series against each other to remind ourselves of the relationship.
plt.scatter(gdp, mmr)
plt.title('Maternal mortality ratio as a function of GDP')
Text(0.5, 1.0, 'Maternal mortality ratio as a function of GDP')
Our question was whether the GDP might be a misleading measure, because it will depend, in part, on the population. More people can earn more money. We were interested to calculate a GDP value adjusted for the population.
But first, let us investigate Series a little more.
Series have some of the same methods as DataFrames#
gdp
is a Series:
type(gdp)
pandas.core.series.Series
As the DdataFrame has .head
and .tail
methods to show the first 5 and last 5 rows (by default), so the Series has .head
and .tail
:
gdp.head()
country_code
AFG 19.961015
AGO 111.936542
ALB 12.327586
ARE 375.027082
ARG 550.980968
Name: gdp_us_billion, dtype: float64
gdp.head(10)
country_code
AFG 19.961015
AGO 111.936542
ALB 12.327586
ARE 375.027082
ARG 550.980968
ARM 10.885362
AUS 1422.994116
AUT 407.494276
AZE 62.003001
BDI 2.876978
Name: gdp_us_billion, dtype: float64
gdp.tail()
country_code
WSM 0.799887
YEM 36.819337
ZAF 345.209888
ZMB 24.280990
ZWE 15.495514
Name: gdp_us_billion, dtype: float64
As you remember we can sort a DataFrame using the .sort_values
method:
labeled_gdata.sort_values('gdp_us_billion')
country_name | gdp_us_billion | mat_mort_ratio | population | |
---|---|---|---|---|
country_code | ||||
KIR | Kiribati | 0.177431 | 95.00 | 0.110482 |
STP | Sao Tome and Principe | 0.314540 | 159.50 | 0.191333 |
FSM | Micronesia, Fed. Sts. | 0.319321 | 103.25 | 0.104118 |
TON | Tonga | 0.439179 | 129.25 | 0.105909 |
COM | Comoros | 0.603919 | 349.50 | 0.759556 |
... | ... | ... | ... | ... |
GBR | United Kingdom | 2768.864417 | 9.25 | 64.641557 |
DEU | Germany | 3601.226158 | 6.25 | 81.281645 |
JPN | Japan | 5106.024760 | 5.75 | 127.297102 |
CHN | China | 10182.790479 | 28.75 | 1364.446000 |
USA | United States | 17369.124600 | 14.00 | 318.558175 |
179 rows × 4 columns
This is also true of a Series:
gdp.sort_values()
country_code
KIR 0.177431
STP 0.314540
FSM 0.319321
TON 0.439179
COM 0.603919
...
GBR 2768.864417
DEU 3601.226158
JPN 5106.024760
CHN 10182.790479
USA 17369.124600
Name: gdp_us_billion, Length: 179, dtype: float64
Notice that, for the Series, we don’t have to give .sort_values
the column
name, because the Series is already the column we want to sort.
A Series has values and labels#
A Series is like an array, in that it contains a sequence of values. In fact, the Series holds that sequence of values in an array. You can get the sequence of values from the Series with the np.array
function:
# The values from a Series as an array
np.array(gdp)
array([1.99610151e+01, 1.11936542e+02, 1.23275859e+01, 3.75027082e+02,
5.50980968e+02, 1.08853625e+01, 1.42299412e+03, 4.07494276e+02,
6.20030013e+01, 2.87697831e+00, 4.94221836e+02, 8.77815063e+00,
1.17530544e+01, 1.74545099e+02, 5.37976122e+01, 3.20040106e+01,
8.68800000e+00, 1.73233271e+01, 6.47829419e+01, 1.68032497e+00,
3.15093236e+01, 2.19876561e+03, 4.41308000e+00, 1.57192226e+01,
1.97514532e+00, 1.51133948e+01, 1.74910987e+00, 1.70847363e+03,
6.76642359e+02, 2.59208554e+02, 1.01827905e+04, 3.25358753e+01,
2.81421556e+01, 3.25386626e+01, 1.16655768e+01, 3.40405888e+02,
6.03918965e-01, 1.73054354e+00, 5.18299661e+01, 7.95194750e+01,
2.23473982e+01, 2.00535631e+02, 3.60122616e+03, 1.53090824e+00,
3.26096204e+02, 6.54993582e+01, 1.90734615e+02, 9.66650964e+01,
3.08496722e+02, 1.29972426e+03, 2.39872406e+01, 5.66819675e+01,
2.53688521e+02, 4.33093138e+00, 2.64764973e+03, 3.19320780e-01,
1.62834944e+01, 2.76886442e+03, 1.53644509e+01, 4.17188959e+01,
6.30423092e+00, 9.13773107e-01, 1.06283063e+00, 1.76270598e+01,
2.22206258e+02, 9.10843446e-01, 5.90985382e+01, 3.10872347e+00,
1.98292083e+01, 5.40874424e+01, 8.37327626e+00, 1.29470864e+02,
9.02944866e+02, 2.01900541e+03, 2.59826259e+02, 4.79398094e+02,
2.07685388e+02, 1.67415845e+01, 2.95577073e+02, 2.00598398e+03,
1.42531394e+01, 3.53060370e+01, 5.10602476e+03, 1.96818842e+02,
6.02503997e+01, 6.92754607e+00, 1.68665072e+01, 1.77430636e-01,
1.34675116e+03, 1.56226123e+02, 1.31391168e+01, 4.55819922e+01,
1.96600000e+00, 1.36256387e+00, 7.68085057e+01, 2.45329837e+00,
4.44015393e+01, 6.05560448e+01, 2.88860491e+01, 1.03402329e+02,
7.30314452e+00, 1.01848586e+01, 3.08680306e+00, 1.18880278e+03,
1.05752957e+01, 1.32103703e+01, 1.03510583e+01, 6.30938400e+01,
4.26661171e+00, 1.20006207e+01, 1.46655026e+01, 5.16400962e+00,
1.20895485e+01, 5.88343532e+00, 3.13686193e+02, 1.20684535e+01,
7.50148231e+00, 4.86113579e+02, 1.18747997e+01, 8.19285000e+02,
4.57585186e+02, 2.01166147e+01, 1.85598413e+02, 7.45575405e+01,
2.50934589e+02, 4.82593427e+01, 1.95244387e+02, 2.80838449e+02,
1.59111580e+01, 5.03311262e+02, 1.02107758e+02, 2.15143697e+02,
2.78331214e+01, 1.25082200e+01, 1.81779231e+02, 1.85384091e+02,
1.82269170e+03, 7.91832002e+00, 7.07936120e+02, 8.30167318e+01,
1.45395548e+01, 2.98724394e+02, 1.11453462e+00, 4.33160391e+00,
2.52137140e+01, 5.78525000e+00, 4.10756437e+01, 1.14809386e+01,
3.14539986e-01, 4.77315902e+00, 9.38944735e+01, 4.60488627e+01,
5.40626904e+02, 4.34681654e+00, 1.19459416e+01, 4.18361027e+00,
4.06136904e+02, 8.03622825e+00, 3.79730958e+01, 1.36142965e+00,
4.39178883e-01, 2.45709470e+01, 4.48243740e+01, 8.95175577e+02,
4.49355418e+01, 2.59414607e+01, 1.35379275e+02, 5.43451323e+01,
1.73691246e+04, 6.13406487e+01, 7.30106763e-01, 3.76146268e+02,
1.81820736e+02, 7.82875953e-01, 7.99887347e-01, 3.68193365e+01,
3.45209888e+02, 2.42809899e+01, 1.54955139e+01])
Notice that, by making the Series into an array, we have thrown away to the row labels.
The Series also has labels. These labels correspond to the row labels for the
DataFrame, and, like them, you can find the Series labels in the Series
.index
attribute:
gdp.index
Index(['AFG', 'AGO', 'ALB', 'ARE', 'ARG', 'ARM', 'AUS', 'AUT', 'AZE', 'BDI',
...
'UZB', 'VCT', 'VEN', 'VNM', 'VUT', 'WSM', 'YEM', 'ZAF', 'ZMB', 'ZWE'],
dtype='object', name='country_code', length=179)
Think of the Series as the association of the values (np.array(gdp)
) and the corresponding labels (gdp.index
).
Calculations on Series work like calculation on arrays#
As you remember, calculations on arrays work elementwise. For example, if you multiply an array by a number, that has the effect of making a new array, where the result is each element of the original array multiplied by the number.
The same is true of calculations on Series. For example, we might want to calculate the GDP in US million dollars instead of its current values in US billion:
# GDP in US million
gdp * 1000
country_code
AFG 19961.015094
AGO 111936.542134
ALB 12327.585927
ARE 375027.082337
ARG 550980.967906
...
WSM 799.887347
YEM 36819.336505
ZAF 345209.888495
ZMB 24280.989920
ZWE 15495.513860
Name: gdp_us_billion, Length: 179, dtype: float64
The elementwise calculations also apply to operations on two Series. In fact, that is the key to solving our problem of getting the GDP values divided by the population. We make the population DataFrame column into a Series.
# Population is in millions.
pop = labeled_gdata['population']
pop
country_code
AFG 32.715838
AGO 26.937545
ALB 2.888280
ARE 9.080299
ARG 42.976675
...
WSM 0.192225
YEM 26.246608
ZAF 54.177209
ZMB 15.633220
ZWE 15.420964
Name: population, Length: 179, dtype: float64
Then we can use elementwise calculation to divide the values in the two series, elementwise, like this:
# GDP per million people.
gdp_per_mcap = gdp / pop
gdp_per_mcap
country_code
AFG 0.610133
AGO 4.155410
ALB 4.268141
ARE 41.301180
ARG 12.820465
...
WSM 4.161204
YEM 1.402823
ZAF 6.371865
ZMB 1.553166
ZWE 1.004834
Length: 179, dtype: float64
This is what we wanted, the GDP divided by the population. Let’s see if there is a more convincing relationship between the GDP per million and the MMR:
plt.scatter(gdp_per_mcap, mmr)
plt.title('MMR as a function of GDP per million people')
Text(0.5, 1.0, 'MMR as a function of GDP per million people')
You can insert Series as columns into DataFrames#
Just as you can make a Series by indexing into a DataFrame, you can insert a Series into a DataFrame as a column, by using indexing.
# Insert new column into DataFrame
labeled_gdata['gdp_per_mcap'] = gdp_per_mcap
labeled_gdata.head()
country_name | gdp_us_billion | mat_mort_ratio | population | gdp_per_mcap | |
---|---|---|---|---|---|
country_code | |||||
AFG | Afghanistan | 19.961015 | 444.00 | 32.715838 | 0.610133 |
AGO | Angola | 111.936542 | 501.25 | 26.937545 | 4.155410 |
ALB | Albania | 12.327586 | 29.25 | 2.888280 | 4.268141 |
ARE | United Arab Emirates | 375.027082 | 6.00 | 9.080299 | 41.301180 |
ARG | Argentina | 550.980968 | 53.75 | 42.976675 | 12.820465 |
Scroll across the DataFrame display to see the new column at the end.
Here we inserted the Series into the labeled_gdata
DataFrame as new column,
by using direct indexing with column label on the Right Hand Side. Read the
assignment above as “make a column called ‘gdp_per_mcap’ in labeled_gdata
and
fill it with the values from the gdp_per_mcap
Series”.
With the Series data in the DataFrame, we can sort the DataFrame by the new GDP per million values:
gdata_by_gdp_mcap = labeled_gdata.sort_values('gdp_per_mcap')
gdata_by_gdp_mcap.head()
country_name | gdp_us_billion | mat_mort_ratio | population | gdp_per_mcap | |
---|---|---|---|---|---|
country_code | |||||
BDI | Burundi | 2.876978 | 747.25 | 9.907015 | 0.290398 |
MWI | Malawi | 5.883435 | 633.00 | 17.081694 | 0.344429 |
CAF | Central African Republic | 1.749110 | 875.75 | 4.529236 | 0.386182 |
NER | Niger | 7.501482 | 585.50 | 19.175235 | 0.391207 |
SOM | Somalia | 5.785250 | 762.75 | 13.527075 | 0.427679 |
Let us look to see if sorting this way gives a clearer picture of the relationship of income to MMR. Get the richest 25 countries in terms of GDP per million:
richest_per_mcap_25 = gdata_by_gdp_mcap.tail(25)
richest_per_mcap_25
country_name | gdp_us_billion | mat_mort_ratio | population | gdp_per_mcap | |
---|---|---|---|---|---|
country_code | |||||
ISR | Israel | 295.577073 | 5.00 | 8.222580 | 35.946999 |
BRN | Brunei Darussalam | 15.719223 | 23.75 | 0.411581 | 38.192275 |
FRA | France | 2647.649725 | 8.75 | 66.302099 | 39.933121 |
JPN | Japan | 5106.024760 | 5.75 | 127.297102 | 40.111084 |
NZL | New Zealand | 185.598413 | 11.50 | 4.529660 | 40.974027 |
ARE | United Arab Emirates | 375.027082 | 6.00 | 9.080299 | 41.301180 |
KWT | Kuwait | 156.226123 | 4.00 | 3.752954 | 41.627510 |
GBR | United Kingdom | 2768.864417 | 9.25 | 64.641557 | 42.834123 |
BEL | Belgium | 494.221836 | 7.00 | 11.228495 | 44.014967 |
DEU | Germany | 3601.226158 | 6.25 | 81.281645 | 44.305528 |
FIN | Finland | 253.688521 | 3.00 | 5.457816 | 46.481688 |
AUT | Austria | 407.494276 | 4.00 | 8.566294 | 47.569497 |
CAN | Canada | 1708.473627 | 7.25 | 35.517119 | 48.102821 |
NLD | Netherlands | 819.285000 | 7.00 | 16.876547 | 48.545773 |
ISL | Iceland | 16.741585 | 3.50 | 0.327387 | 51.137049 |
USA | United States | 17369.124600 | 14.00 | 318.558175 | 54.524184 |
SGP | Singapore | 298.724394 | 10.75 | 5.464722 | 54.664156 |
SWE | Sweden | 540.626904 | 4.00 | 9.703634 | 55.713859 |
IRL | Ireland | 259.826259 | 8.00 | 4.650469 | 55.870977 |
DNK | Denmark | 326.096204 | 6.75 | 5.652916 | 57.686370 |
AUS | Australia | 1422.994116 | 6.00 | 23.444560 | 60.696133 |
QAT | Qatar | 181.779231 | 13.25 | 2.357161 | 77.117881 |
CHE | Switzerland | 676.642359 | 5.25 | 8.185870 | 82.659798 |
NOR | Norway | 457.585186 | 5.00 | 5.131393 | 89.173681 |
LUX | Luxembourg | 60.556045 | 10.25 | 0.556640 | 108.788486 |
Plot the relationship of GDP per million and MMR:
plt.scatter(richest_per_mcap_25['gdp_per_mcap'], richest_per_mcap_25['mat_mort_ratio'])
plt.title('MMR as function of GDP per million, richest 25')
Text(0.5, 1.0, 'MMR as function of GDP per million, richest 25')
We might be interested in looking at the richest countries in terms of the MMR, by sorting. The countries doing best at reducing MMR are first, those doing worst are last.
richest_per_mcap_25.sort_values('mat_mort_ratio')
country_name | gdp_us_billion | mat_mort_ratio | population | gdp_per_mcap | |
---|---|---|---|---|---|
country_code | |||||
FIN | Finland | 253.688521 | 3.00 | 5.457816 | 46.481688 |
ISL | Iceland | 16.741585 | 3.50 | 0.327387 | 51.137049 |
KWT | Kuwait | 156.226123 | 4.00 | 3.752954 | 41.627510 |
SWE | Sweden | 540.626904 | 4.00 | 9.703634 | 55.713859 |
AUT | Austria | 407.494276 | 4.00 | 8.566294 | 47.569497 |
ISR | Israel | 295.577073 | 5.00 | 8.222580 | 35.946999 |
NOR | Norway | 457.585186 | 5.00 | 5.131393 | 89.173681 |
CHE | Switzerland | 676.642359 | 5.25 | 8.185870 | 82.659798 |
JPN | Japan | 5106.024760 | 5.75 | 127.297102 | 40.111084 |
AUS | Australia | 1422.994116 | 6.00 | 23.444560 | 60.696133 |
ARE | United Arab Emirates | 375.027082 | 6.00 | 9.080299 | 41.301180 |
DEU | Germany | 3601.226158 | 6.25 | 81.281645 | 44.305528 |
DNK | Denmark | 326.096204 | 6.75 | 5.652916 | 57.686370 |
BEL | Belgium | 494.221836 | 7.00 | 11.228495 | 44.014967 |
NLD | Netherlands | 819.285000 | 7.00 | 16.876547 | 48.545773 |
CAN | Canada | 1708.473627 | 7.25 | 35.517119 | 48.102821 |
IRL | Ireland | 259.826259 | 8.00 | 4.650469 | 55.870977 |
FRA | France | 2647.649725 | 8.75 | 66.302099 | 39.933121 |
GBR | United Kingdom | 2768.864417 | 9.25 | 64.641557 | 42.834123 |
LUX | Luxembourg | 60.556045 | 10.25 | 0.556640 | 108.788486 |
SGP | Singapore | 298.724394 | 10.75 | 5.464722 | 54.664156 |
NZL | New Zealand | 185.598413 | 11.50 | 4.529660 | 40.974027 |
QAT | Qatar | 181.779231 | 13.25 | 2.357161 | 77.117881 |
USA | United States | 17369.124600 | 14.00 | 318.558175 | 54.524184 |
BRN | Brunei Darussalam | 15.719223 | 23.75 | 0.411581 | 38.192275 |
Conversely, we might want to take the poorest 75 by GDP per million, and look at the best and worst by MMR:
poorest_by_mcap_75 = gdata_by_gdp_mcap.head(75)
poorest_by_mcap_75.sort_values('mat_mort_ratio')
country_name | gdp_us_billion | mat_mort_ratio | population | gdp_per_mcap | |
---|---|---|---|---|---|
country_code | |||||
MDA | Moldova | 7.303145 | 24.25 | 3.556118 | 2.053685 |
UKR | Ukraine | 135.379275 | 24.25 | 45.302704 | 2.988327 |
ARM | Armenia | 10.885362 | 27.25 | 2.904683 | 3.747521 |
LKA | Sri Lanka | 76.808506 | 31.25 | 20.790000 | 3.694493 |
TJK | Tajikistan | 8.036228 | 33.25 | 8.363844 | 0.960830 |
... | ... | ... | ... | ... | ... |
NGA | Nigeria | 486.113579 | 818.50 | 176.551695 | 2.753378 |
SSD | South Sudan | 11.480939 | 827.50 | 11.527917 | 0.995925 |
CAF | Central African Republic | 1.749110 | 875.75 | 4.529236 | 0.386182 |
TCD | Chad | 11.945942 | 892.25 | 13.574024 | 0.880059 |
SLE | Sierra Leone | 4.331604 | 1435.00 | 7.080112 | 0.611799 |
75 rows × 5 columns