Sorting, heads and tails

Sorting, heads and tails#

# Load the Pandas data science library, call it 'pd'
import pandas as pd
# Turn on a setting to use Pandas more safely.
# We will discuss this setting later.
pd.set_option('mode.copy_on_write', True)

# Load the library for plotting, name it 'plt'
import matplotlib.pyplot as plt
# Make plots look a little more fancy
plt.style.use('fivethirtyeight')

In basic column indexing we found the slightly odd result that, for rich countries, there is little relationship between GDP and maternal mortality.

Here we investigate further, by looking at rich and poor countries separately.

In order to do that, we are going to sort our DataFrame using the .sort_values method, and the select the first and last group of rows, using the .head and .tail methods.

First let us return to the slightly processed DataFrame we were working on before:

# Original data frame before dropping missing values.
gender_data = pd.read_csv('gender_stats_min.csv')
gender_data_no_na = gender_data.dropna()
labeled_gdata = gender_data_no_na.set_index('country_code')
labeled_gdata

	country_name	gdp_us_billion	mat_mort_ratio	population
country_code
AFG	Afghanistan	19.961015	444.00	32.715838
AGO	Angola	111.936542	501.25	26.937545
ALB	Albania	12.327586	29.25	2.888280
ARE	United Arab Emirates	375.027082	6.00	9.080299
ARG	Argentina	550.980968	53.75	42.976675
...	...	...	...	...
WSM	Samoa	0.799887	54.75	0.192225
YEM	Yemen, Rep.	36.819337	399.75	26.246608
ZAF	South Africa	345.209888	143.75	54.177209
ZMB	Zambia	24.280990	233.75	15.633220
ZWE	Zimbabwe	15.495514	398.00	15.420964

179 rows × 4 columns

Here is the plot we saw before, with the unconvincing relationship of GDP to Maternal Mortality Rate (MMR).

plt.scatter(labeled_gdata['gdp_us_billion'],
            labeled_gdata['mat_mort_ratio'])
plt.title('MMR as a function of GDP')

Text(0.5, 1.0, 'MMR as a function of GDP')

../_images/c3271836f0db8d749f61ef0752da991cec6991af1f6cfdbb542a59472e3e1c04.png

We wondered whether the relationship of GDP and MMR might be different for rich and poor countries.

To look at that, we can sort the DataFrame by the GDP values.

In order to do that, we use the .sort_values method, passing the column name containing the values we want to sort by:

gdata_by_gdp = labeled_gdata.sort_values('gdp_us_billion')
gdata_by_gdp

	country_name	gdp_us_billion	mat_mort_ratio	population
country_code
KIR	Kiribati	0.177431	95.00	0.110482
STP	Sao Tome and Principe	0.314540	159.50	0.191333
FSM	Micronesia, Fed. Sts.	0.319321	103.25	0.104118
TON	Tonga	0.439179	129.25	0.105909
COM	Comoros	0.603919	349.50	0.759556
...	...	...	...	...
GBR	United Kingdom	2768.864417	9.25	64.641557
DEU	Germany	3601.226158	6.25	81.281645
JPN	Japan	5106.024760	5.75	127.297102
CHN	China	10182.790479	28.75	1364.446000
USA	United States	17369.124600	14.00	318.558175

179 rows × 4 columns

Notice that the .sort_values method returned a new data frame with the rows in ascending order of the values in the given column (here gdp_us_billion). We therefore have a DataFrame where the richest countries are first and the poorest last.

Ascending order is the default sort order, but you can ask for descending order by giving the ascending keyword argument a value of False, like this:

gdata_by_desc_gdp = labeled_gdata.sort_values('gdp_us_billion',
                                               ascending=False)
gdata_by_desc_gdp

	country_name	gdp_us_billion	mat_mort_ratio	population
country_code
USA	United States	17369.124600	14.00	318.558175
CHN	China	10182.790479	28.75	1364.446000
JPN	Japan	5106.024760	5.75	127.297102
DEU	Germany	3601.226158	6.25	81.281645
GBR	United Kingdom	2768.864417	9.25	64.641557
...	...	...	...	...
COM	Comoros	0.603919	349.50	0.759556
TON	Tonga	0.439179	129.25	0.105909
FSM	Micronesia, Fed. Sts.	0.319321	103.25	0.104118
STP	Sao Tome and Principe	0.314540	159.50	0.191333
KIR	Kiribati	0.177431	95.00	0.110482

179 rows × 4 columns

Notice that now the richest countries are first and the poorest last.

Let us go back to the poorest to richest sorted DataFrame, gdata_by_gdp. The DataFrame has a .head method that, by default, will select the first 5 rows of the DataFrame:

gdata_by_gdp.head()

	country_name	gdp_us_billion	mat_mort_ratio	population
country_code
KIR	Kiribati	0.177431	95.00	0.110482
STP	Sao Tome and Principe	0.314540	159.50	0.191333
FSM	Micronesia, Fed. Sts.	0.319321	103.25	0.104118
TON	Tonga	0.439179	129.25	0.105909
COM	Comoros	0.603919	349.50	0.759556

Notice that the result is a new DataFrame that only has 5 rows.

In fact we often use .head to show a small sample of the DataFrame, and you will see that use throughout the rest of the course.

You can also give .head a number of rows you want. For example to select the 125 poorest countries (in terms of GDP), you could use:

poorest_125 = gdata_by_gdp.head(125)
poorest_125

	country_name	gdp_us_billion	mat_mort_ratio	population
country_code
KIR	Kiribati	0.177431	95.00	0.110482
STP	Sao Tome and Principe	0.314540	159.50	0.191333
FSM	Micronesia, Fed. Sts.	0.319321	103.25	0.104118
TON	Tonga	0.439179	129.25	0.105909
COM	Comoros	0.603919	349.50	0.759556
...	...	...	...	...
AGO	Angola	111.936542	501.25	26.937545
HUN	Hungary	129.470864	16.25	9.868180
UKR	Ukraine	135.379275	24.25	45.302704
KWT	Kuwait	156.226123	4.00	3.752954
BGD	Bangladesh	174.545099	194.75	159.371214

125 rows × 4 columns

Now we have the rows corresponding to the 125 poorest countries, we can repeat our GDP / MMR plot, restricted to those countries:

plt.scatter(poorest_125['gdp_us_billion'], poorest_125['mat_mort_ratio'])
plt.title('MMR as a function of GDP, for 125 poorest countries')

Text(0.5, 1.0, 'MMR as a function of GDP, for 125 poorest countries')

../_images/ff704b37db9be9588c1d5bf61e6a2b9581ceba46e3f293bf11005119dc38e8fd.png

If we sort the new DataFrame by the MMR values, we can see which of these 125 poorest countries are doing particularly well or badly in terms of MMR:

poorest_125.sort_values('mat_mort_ratio')

	country_name	gdp_us_billion	mat_mort_ratio	population
country_code
ISL	Iceland	16.741585	3.50	0.327387
BLR	Belarus	64.782942	4.00	9.480348
KWT	Kuwait	156.226123	4.00	3.752954
SVK	Slovak Republic	93.894473	6.00	5.418425
CYP	Cyprus	22.347398	7.00	1.152475
...	...	...	...	...
SOM	Somalia	5.785250	762.75	13.527075
SSD	South Sudan	11.480939	827.50	11.527917
CAF	Central African Republic	1.749110	875.75	4.529236
TCD	Chad	11.945942	892.25	13.574024
SLE	Sierra Leone	4.331604	1435.00	7.080112

125 rows × 4 columns

DataFrames also have .tail method that, by default, gives the last 5 rows of the DataFrame. For example, these are the 5 richest countries:

gdata_by_gdp.tail()

	country_name	gdp_us_billion	mat_mort_ratio	population
country_code
GBR	United Kingdom	2768.864417	9.25	64.641557
DEU	Germany	3601.226158	6.25	81.281645
JPN	Japan	5106.024760	5.75	127.297102
CHN	China	10182.790479	28.75	1364.446000
USA	United States	17369.124600	14.00	318.558175

Like .head we can give .tail a number of rows we want. Here we are looking at the last 25 rows of the sorted DataFrame, and therefore, the 25 richest countries:

richest_25 = gdata_by_gdp.tail(25)
richest_25

	country_name	gdp_us_billion	mat_mort_ratio	population
country_code
NGA	Nigeria	486.113579	818.50	176.551695
BEL	Belgium	494.221836	7.00	11.228495
POL	Poland	503.311262	3.00	38.009905
SWE	Sweden	540.626904	4.00	9.703634
ARG	Argentina	550.980968	53.75	42.976675
CHE	Switzerland	676.642359	5.25	8.185870
SAU	Saudi Arabia	707.936120	12.25	30.728077
NLD	Netherlands	819.285000	7.00	16.876547
TUR	Turkey	895.175577	17.50	77.034345
IDN	Indonesia	902.944866	136.75	255.064836
MEX	Mexico	1188.802780	40.00	124.203450
ESP	Spain	1299.724261	5.00	46.553128
KOR	Korea, Rep.	1346.751162	12.00	50.727212
AUS	Australia	1422.994116	6.00	23.444560
CAN	Canada	1708.473627	7.25	35.517119
RUS	Russian Federation	1822.691700	25.25	143.793504
ITA	Italy	2005.983980	4.00	60.378795
IND	India	2019.005411	185.25	1293.742537
BRA	Brazil	2198.765606	49.50	204.159544
FRA	France	2647.649725	8.75	66.302099
GBR	United Kingdom	2768.864417	9.25	64.641557
DEU	Germany	3601.226158	6.25	81.281645
JPN	Japan	5106.024760	5.75	127.297102
CHN	China	10182.790479	28.75	1364.446000
USA	United States	17369.124600	14.00	318.558175

plt.scatter(richest_25['gdp_us_billion'], richest_25['mat_mort_ratio'])
plt.title('MMR as a function of GDP, for 25 richest countries')

Text(0.5, 1.0, 'MMR as a function of GDP, for 25 richest countries')

../_images/73143fb2d0d1f365bed7eb746cec588f654c096e6bf47493b47b772b2cb06fd6.png

Again, we can sort by the MMR values to show the best and worst countries in terms of maternal health:

richest_25.sort_values('mat_mort_ratio')

	country_name	gdp_us_billion	mat_mort_ratio	population
country_code
POL	Poland	503.311262	3.00	38.009905
SWE	Sweden	540.626904	4.00	9.703634
ITA	Italy	2005.983980	4.00	60.378795
ESP	Spain	1299.724261	5.00	46.553128
CHE	Switzerland	676.642359	5.25	8.185870
JPN	Japan	5106.024760	5.75	127.297102
AUS	Australia	1422.994116	6.00	23.444560
DEU	Germany	3601.226158	6.25	81.281645
BEL	Belgium	494.221836	7.00	11.228495
NLD	Netherlands	819.285000	7.00	16.876547
CAN	Canada	1708.473627	7.25	35.517119
FRA	France	2647.649725	8.75	66.302099
GBR	United Kingdom	2768.864417	9.25	64.641557
KOR	Korea, Rep.	1346.751162	12.00	50.727212
SAU	Saudi Arabia	707.936120	12.25	30.728077
USA	United States	17369.124600	14.00	318.558175
TUR	Turkey	895.175577	17.50	77.034345
RUS	Russian Federation	1822.691700	25.25	143.793504
CHN	China	10182.790479	28.75	1364.446000
MEX	Mexico	1188.802780	40.00	124.203450
BRA	Brazil	2198.765606	49.50	204.159544
ARG	Argentina	550.980968	53.75	42.976675
IDN	Indonesia	902.944866	136.75	255.064836
IND	India	2019.005411	185.25	1293.742537
NGA	Nigeria	486.113579	818.50	176.551695

To investigate further, we need to do some calculations to adjust for the population.