Series are like arrays

Series are like arrays#

In this page, we look at Pandas’ Series. Series are the Pandas type that represents a column of data.

# Load the Numpy array library, call it 'np'
import numpy as np
# Load the Pandas data science library, call it 'pd'
import pandas as pd
# Turn on a setting to use Pandas more safely.
# We will discuss this setting later.
pd.set_option('mode.copy_on_write', True)

# Load the library for plotting, name it 'plt'
import matplotlib.pyplot as plt
# Make plots look a little more fancy
plt.style.use('fivethirtyeight')

We return to our original data frame, with the missing values dropped, and the rows labels with the country codes:

# Original data frame before dropping missing values.
gender_data = pd.read_csv('gender_stats_min.csv')
gender_data_no_na = gender_data.dropna()
labeled_gdata = gender_data_no_na.set_index('country_code')
labeled_gdata.head()

	country_name	gdp_us_billion	mat_mort_ratio	population
country_code
AFG	Afghanistan	19.961015	444.00	32.715838
AGO	Angola	111.936542	501.25	26.937545
ALB	Albania	12.327586	29.25	2.888280
ARE	United Arab Emirates	375.027082	6.00	9.080299
ARG	Argentina	550.980968	53.75	42.976675

We found that there was a rather unconvincing relationship between the GDP values, and the Maternal Mortality Rate (MMR) values.

First we fetch those values from their corresponding DataFrame columns, using direct indexing with column labels:

gdp = labeled_gdata['gdp_us_billion']
gdp

country_code
AFG     19.961015
AGO    111.936542
ALB     12.327586
ARE    375.027082
ARG    550.980968
          ...    
WSM      0.799887
YEM     36.819337
ZAF    345.209888
ZMB     24.280990
ZWE     15.495514
Name: gdp_us_billion, Length: 179, dtype: float64

mmr = labeled_gdata['mat_mort_ratio']
mmr

country_code
AFG    444.00
AGO    501.25
ALB     29.25
ARE      6.00
ARG     53.75
        ...  
WSM     54.75
YEM    399.75
ZAF    143.75
ZMB    233.75
ZWE    398.00
Name: mat_mort_ratio, Length: 179, dtype: float64

We plot the two Series against each other to remind ourselves of the relationship.

plt.scatter(gdp, mmr)
plt.title('Maternal mortality ratio as a function of GDP')

Text(0.5, 1.0, 'Maternal mortality ratio as a function of GDP')

../_images/06fe1606edbb576e75dfc834d93fff5e4e8e455b14741a71719db0d55bbe866c.png

Our question was whether the GDP might be a misleading measure, because it will depend, in part, on the population. More people can earn more money. We were interested to calculate a GDP value adjusted for the population.

But first, let us investigate Series a little more.

Series have some of the same methods as DataFrames#

gdp is a Series:

type(gdp)

pandas.core.series.Series

As the DdataFrame has .head and .tail methods to show the first 5 and last 5 rows (by default), so the Series has .head and .tail:

gdp.head()

country_code
AFG     19.961015
AGO    111.936542
ALB     12.327586
ARE    375.027082
ARG    550.980968
Name: gdp_us_billion, dtype: float64

gdp.head(10)

country_code
AFG      19.961015
AGO     111.936542
ALB      12.327586
ARE     375.027082
ARG     550.980968
ARM      10.885362
AUS    1422.994116
AUT     407.494276
AZE      62.003001
BDI       2.876978
Name: gdp_us_billion, dtype: float64

gdp.tail()

country_code
WSM      0.799887
YEM     36.819337
ZAF    345.209888
ZMB     24.280990
ZWE     15.495514
Name: gdp_us_billion, dtype: float64

As you remember we can sort a DataFrame using the .sort_values method:

labeled_gdata.sort_values('gdp_us_billion')

	country_name	gdp_us_billion	mat_mort_ratio	population
country_code
KIR	Kiribati	0.177431	95.00	0.110482
STP	Sao Tome and Principe	0.314540	159.50	0.191333
FSM	Micronesia, Fed. Sts.	0.319321	103.25	0.104118
TON	Tonga	0.439179	129.25	0.105909
COM	Comoros	0.603919	349.50	0.759556
...	...	...	...	...
GBR	United Kingdom	2768.864417	9.25	64.641557
DEU	Germany	3601.226158	6.25	81.281645
JPN	Japan	5106.024760	5.75	127.297102
CHN	China	10182.790479	28.75	1364.446000
USA	United States	17369.124600	14.00	318.558175

179 rows × 4 columns

This is also true of a Series:

gdp.sort_values()

country_code
KIR        0.177431
STP        0.314540
FSM        0.319321
TON        0.439179
COM        0.603919
           ...     
GBR     2768.864417
DEU     3601.226158
JPN     5106.024760
CHN    10182.790479
USA    17369.124600
Name: gdp_us_billion, Length: 179, dtype: float64

Notice that, for the Series, we don’t have to give .sort_values the column name, because the Series is already the column we want to sort.

A Series has values and labels#

A Series is like an array, in that it contains a sequence of values. In fact, the Series holds that sequence of values in an array. You can get the sequence of values from the Series with the np.array function:

# The values from a Series as an array
np.array(gdp)

array([1.99610151e+01, 1.11936542e+02, 1.23275859e+01, 3.75027082e+02,
       5.50980968e+02, 1.08853625e+01, 1.42299412e+03, 4.07494276e+02,
       6.20030013e+01, 2.87697831e+00, 4.94221836e+02, 8.77815063e+00,
       1.17530544e+01, 1.74545099e+02, 5.37976122e+01, 3.20040106e+01,
       8.68800000e+00, 1.73233271e+01, 6.47829419e+01, 1.68032497e+00,
       3.15093236e+01, 2.19876561e+03, 4.41308000e+00, 1.57192226e+01,
       1.97514532e+00, 1.51133948e+01, 1.74910987e+00, 1.70847363e+03,
       6.76642359e+02, 2.59208554e+02, 1.01827905e+04, 3.25358753e+01,
       2.81421556e+01, 3.25386626e+01, 1.16655768e+01, 3.40405888e+02,
       6.03918965e-01, 1.73054354e+00, 5.18299661e+01, 7.95194750e+01,
       2.23473982e+01, 2.00535631e+02, 3.60122616e+03, 1.53090824e+00,
       3.26096204e+02, 6.54993582e+01, 1.90734615e+02, 9.66650964e+01,
       3.08496722e+02, 1.29972426e+03, 2.39872406e+01, 5.66819675e+01,
       2.53688521e+02, 4.33093138e+00, 2.64764973e+03, 3.19320780e-01,
       1.62834944e+01, 2.76886442e+03, 1.53644509e+01, 4.17188959e+01,
       6.30423092e+00, 9.13773107e-01, 1.06283063e+00, 1.76270598e+01,
       2.22206258e+02, 9.10843446e-01, 5.90985382e+01, 3.10872347e+00,
       1.98292083e+01, 5.40874424e+01, 8.37327626e+00, 1.29470864e+02,
       9.02944866e+02, 2.01900541e+03, 2.59826259e+02, 4.79398094e+02,
       2.07685388e+02, 1.67415845e+01, 2.95577073e+02, 2.00598398e+03,
       1.42531394e+01, 3.53060370e+01, 5.10602476e+03, 1.96818842e+02,
       6.02503997e+01, 6.92754607e+00, 1.68665072e+01, 1.77430636e-01,
       1.34675116e+03, 1.56226123e+02, 1.31391168e+01, 4.55819922e+01,
       1.96600000e+00, 1.36256387e+00, 7.68085057e+01, 2.45329837e+00,
       4.44015393e+01, 6.05560448e+01, 2.88860491e+01, 1.03402329e+02,
       7.30314452e+00, 1.01848586e+01, 3.08680306e+00, 1.18880278e+03,
       1.05752957e+01, 1.32103703e+01, 1.03510583e+01, 6.30938400e+01,
       4.26661171e+00, 1.20006207e+01, 1.46655026e+01, 5.16400962e+00,
       1.20895485e+01, 5.88343532e+00, 3.13686193e+02, 1.20684535e+01,
       7.50148231e+00, 4.86113579e+02, 1.18747997e+01, 8.19285000e+02,
       4.57585186e+02, 2.01166147e+01, 1.85598413e+02, 7.45575405e+01,
       2.50934589e+02, 4.82593427e+01, 1.95244387e+02, 2.80838449e+02,
       1.59111580e+01, 5.03311262e+02, 1.02107758e+02, 2.15143697e+02,
       2.78331214e+01, 1.25082200e+01, 1.81779231e+02, 1.85384091e+02,
       1.82269170e+03, 7.91832002e+00, 7.07936120e+02, 8.30167318e+01,
       1.45395548e+01, 2.98724394e+02, 1.11453462e+00, 4.33160391e+00,
       2.52137140e+01, 5.78525000e+00, 4.10756437e+01, 1.14809386e+01,
       3.14539986e-01, 4.77315902e+00, 9.38944735e+01, 4.60488627e+01,
       5.40626904e+02, 4.34681654e+00, 1.19459416e+01, 4.18361027e+00,
       4.06136904e+02, 8.03622825e+00, 3.79730958e+01, 1.36142965e+00,
       4.39178883e-01, 2.45709470e+01, 4.48243740e+01, 8.95175577e+02,
       4.49355418e+01, 2.59414607e+01, 1.35379275e+02, 5.43451323e+01,
       1.73691246e+04, 6.13406487e+01, 7.30106763e-01, 3.76146268e+02,
       1.81820736e+02, 7.82875953e-01, 7.99887347e-01, 3.68193365e+01,
       3.45209888e+02, 2.42809899e+01, 1.54955139e+01])

Notice that, by making the Series into an array, we have thrown away to the row labels.

The Series also has labels. These labels correspond to the row labels for the DataFrame, and, like them, you can find the Series labels in the Series .index attribute:

gdp.index

Index(['AFG', 'AGO', 'ALB', 'ARE', 'ARG', 'ARM', 'AUS', 'AUT', 'AZE', 'BDI',
       ...
       'UZB', 'VCT', 'VEN', 'VNM', 'VUT', 'WSM', 'YEM', 'ZAF', 'ZMB', 'ZWE'],
      dtype='object', name='country_code', length=179)

Think of the Series as the association of the values (np.array(gdp)) and the corresponding labels (gdp.index).

Calculations on Series work like calculation on arrays#

As you remember, calculations on arrays work elementwise. For example, if you multiply an array by a number, that has the effect of making a new array, where the result is each element of the original array multiplied by the number.

The same is true of calculations on Series. For example, we might want to calculate the GDP in US million dollars instead of its current values in US billion:

# GDP in US million
gdp * 1000

country_code
AFG     19961.015094
AGO    111936.542134
ALB     12327.585927
ARE    375027.082337
ARG    550980.967906
           ...      
WSM       799.887347
YEM     36819.336505
ZAF    345209.888495
ZMB     24280.989920
ZWE     15495.513860
Name: gdp_us_billion, Length: 179, dtype: float64

The elementwise calculations also apply to operations on two Series. In fact, that is the key to solving our problem of getting the GDP values divided by the population. We make the population DataFrame column into a Series.

# Population is in millions.
pop = labeled_gdata['population']
pop

country_code
AFG    32.715838
AGO    26.937545
ALB     2.888280
ARE     9.080299
ARG    42.976675
         ...    
WSM     0.192225
YEM    26.246608
ZAF    54.177209
ZMB    15.633220
ZWE    15.420964
Name: population, Length: 179, dtype: float64

Then we can use elementwise calculation to divide the values in the two series, elementwise, like this:

# GDP per million people.
gdp_per_mcap = gdp / pop
gdp_per_mcap

country_code
AFG     0.610133
AGO     4.155410
ALB     4.268141
ARE    41.301180
ARG    12.820465
         ...    
WSM     4.161204
YEM     1.402823
ZAF     6.371865
ZMB     1.553166
ZWE     1.004834
Length: 179, dtype: float64

This is what we wanted, the GDP divided by the population. Let’s see if there is a more convincing relationship between the GDP per million and the MMR:

plt.scatter(gdp_per_mcap, mmr)
plt.title('MMR as a function of GDP per million people')

Text(0.5, 1.0, 'MMR as a function of GDP per million people')

../_images/7105bf218687b0830c317c36ce3e8472a9bf1a1e35446a351f807f9c3458c860.png

You can insert Series as columns into DataFrames#

Just as you can make a Series by indexing into a DataFrame, you can insert a Series into a DataFrame as a column, by using indexing.

# Insert new column into DataFrame
labeled_gdata['gdp_per_mcap'] = gdp_per_mcap
labeled_gdata.head()

	country_name	gdp_us_billion	mat_mort_ratio	population	gdp_per_mcap
country_code
AFG	Afghanistan	19.961015	444.00	32.715838	0.610133
AGO	Angola	111.936542	501.25	26.937545	4.155410
ALB	Albania	12.327586	29.25	2.888280	4.268141
ARE	United Arab Emirates	375.027082	6.00	9.080299	41.301180
ARG	Argentina	550.980968	53.75	42.976675	12.820465

Scroll across the DataFrame display to see the new column at the end.

Here we inserted the Series into the labeled_gdata DataFrame as new column, by using direct indexing with column label on the Right Hand Side. Read the assignment above as “make a column called ‘gdp_per_mcap’ in labeled_gdata and fill it with the values from the gdp_per_mcap Series”.

With the Series data in the DataFrame, we can sort the DataFrame by the new GDP per million values:

gdata_by_gdp_mcap = labeled_gdata.sort_values('gdp_per_mcap')
gdata_by_gdp_mcap.head()

	country_name	gdp_us_billion	mat_mort_ratio	population	gdp_per_mcap
country_code
BDI	Burundi	2.876978	747.25	9.907015	0.290398
MWI	Malawi	5.883435	633.00	17.081694	0.344429
CAF	Central African Republic	1.749110	875.75	4.529236	0.386182
NER	Niger	7.501482	585.50	19.175235	0.391207
SOM	Somalia	5.785250	762.75	13.527075	0.427679

Let us look to see if sorting this way gives a clearer picture of the relationship of income to MMR. Get the richest 25 countries in terms of GDP per million:

richest_per_mcap_25 = gdata_by_gdp_mcap.tail(25)
richest_per_mcap_25

	country_name	gdp_us_billion	mat_mort_ratio	population	gdp_per_mcap
country_code
ISR	Israel	295.577073	5.00	8.222580	35.946999
BRN	Brunei Darussalam	15.719223	23.75	0.411581	38.192275
FRA	France	2647.649725	8.75	66.302099	39.933121
JPN	Japan	5106.024760	5.75	127.297102	40.111084
NZL	New Zealand	185.598413	11.50	4.529660	40.974027
ARE	United Arab Emirates	375.027082	6.00	9.080299	41.301180
KWT	Kuwait	156.226123	4.00	3.752954	41.627510
GBR	United Kingdom	2768.864417	9.25	64.641557	42.834123
BEL	Belgium	494.221836	7.00	11.228495	44.014967
DEU	Germany	3601.226158	6.25	81.281645	44.305528
FIN	Finland	253.688521	3.00	5.457816	46.481688
AUT	Austria	407.494276	4.00	8.566294	47.569497
CAN	Canada	1708.473627	7.25	35.517119	48.102821
NLD	Netherlands	819.285000	7.00	16.876547	48.545773
ISL	Iceland	16.741585	3.50	0.327387	51.137049
USA	United States	17369.124600	14.00	318.558175	54.524184
SGP	Singapore	298.724394	10.75	5.464722	54.664156
SWE	Sweden	540.626904	4.00	9.703634	55.713859
IRL	Ireland	259.826259	8.00	4.650469	55.870977
DNK	Denmark	326.096204	6.75	5.652916	57.686370
AUS	Australia	1422.994116	6.00	23.444560	60.696133
QAT	Qatar	181.779231	13.25	2.357161	77.117881
CHE	Switzerland	676.642359	5.25	8.185870	82.659798
NOR	Norway	457.585186	5.00	5.131393	89.173681
LUX	Luxembourg	60.556045	10.25	0.556640	108.788486

Plot the relationship of GDP per million and MMR:

plt.scatter(richest_per_mcap_25['gdp_per_mcap'], richest_per_mcap_25['mat_mort_ratio'])
plt.title('MMR as function of GDP per million, richest 25')

Text(0.5, 1.0, 'MMR as function of GDP per million, richest 25')

../_images/2b9058926d450e4f6206cd836b3b20f7d69df61cdad85732892ad2a2d2fcd869.png

We might be interested in looking at the richest countries in terms of the MMR, by sorting. The countries doing best at reducing MMR are first, those doing worst are last.

richest_per_mcap_25.sort_values('mat_mort_ratio')

	country_name	gdp_us_billion	mat_mort_ratio	population	gdp_per_mcap
country_code
FIN	Finland	253.688521	3.00	5.457816	46.481688
ISL	Iceland	16.741585	3.50	0.327387	51.137049
KWT	Kuwait	156.226123	4.00	3.752954	41.627510
SWE	Sweden	540.626904	4.00	9.703634	55.713859
AUT	Austria	407.494276	4.00	8.566294	47.569497
ISR	Israel	295.577073	5.00	8.222580	35.946999
NOR	Norway	457.585186	5.00	5.131393	89.173681
CHE	Switzerland	676.642359	5.25	8.185870	82.659798
JPN	Japan	5106.024760	5.75	127.297102	40.111084
AUS	Australia	1422.994116	6.00	23.444560	60.696133
ARE	United Arab Emirates	375.027082	6.00	9.080299	41.301180
DEU	Germany	3601.226158	6.25	81.281645	44.305528
DNK	Denmark	326.096204	6.75	5.652916	57.686370
BEL	Belgium	494.221836	7.00	11.228495	44.014967
NLD	Netherlands	819.285000	7.00	16.876547	48.545773
CAN	Canada	1708.473627	7.25	35.517119	48.102821
IRL	Ireland	259.826259	8.00	4.650469	55.870977
FRA	France	2647.649725	8.75	66.302099	39.933121
GBR	United Kingdom	2768.864417	9.25	64.641557	42.834123
LUX	Luxembourg	60.556045	10.25	0.556640	108.788486
SGP	Singapore	298.724394	10.75	5.464722	54.664156
NZL	New Zealand	185.598413	11.50	4.529660	40.974027
QAT	Qatar	181.779231	13.25	2.357161	77.117881
USA	United States	17369.124600	14.00	318.558175	54.524184
BRN	Brunei Darussalam	15.719223	23.75	0.411581	38.192275

Conversely, we might want to take the poorest 75 by GDP per million, and look at the best and worst by MMR:

poorest_by_mcap_75 = gdata_by_gdp_mcap.head(75)
poorest_by_mcap_75.sort_values('mat_mort_ratio')

	country_name	gdp_us_billion	mat_mort_ratio	population	gdp_per_mcap
country_code
MDA	Moldova	7.303145	24.25	3.556118	2.053685
UKR	Ukraine	135.379275	24.25	45.302704	2.988327
ARM	Armenia	10.885362	27.25	2.904683	3.747521
LKA	Sri Lanka	76.808506	31.25	20.790000	3.694493
TJK	Tajikistan	8.036228	33.25	8.363844	0.960830
...	...	...	...	...	...
NGA	Nigeria	486.113579	818.50	176.551695	2.753378
SSD	South Sudan	11.480939	827.50	11.527917	0.995925
CAF	Central African Republic	1.749110	875.75	4.529236	0.386182
TCD	Chad	11.945942	892.25	13.574024	0.880059
SLE	Sierra Leone	4.331604	1435.00	7.080112	0.611799

75 rows × 5 columns