The DataFrame and the index

The DataFrame and the index#

# Load the Pandas data science library, call it 'pd'
import pandas as pd
# Turn on a setting to use Pandas more safely.
# We will discuss this setting later.
pd.set_option('mode.copy_on_write', True)

In introduction to data frames, we introduced data frames, and looked at columns and column labels.

Here we load the same data from a file to give a DataFrame:

# Original data frame before dropping missing values.
gender_data = pd.read_csv('gender_stats_min.csv')
# Show the result
gender_data

	country_name	country_code	gdp_us_billion	mat_mort_ratio	population
0	Aruba	ABW	NaN	NaN	0.103744
1	Afghanistan	AFG	19.961015	444.00	32.715838
2	Angola	AGO	111.936542	501.25	26.937545
3	Albania	ALB	12.327586	29.25	2.888280
4	Andorra	AND	3.197538	NaN	0.079547
...	...	...	...	...	...
211	Kosovo	XKX	6.804620	NaN	1.813820
212	Yemen, Rep.	YEM	36.819337	399.75	26.246608
213	South Africa	ZAF	345.209888	143.75	54.177209
214	Zambia	ZMB	24.280990	233.75	15.633220
215	Zimbabwe	ZWE	15.495514	398.00	15.420964

216 rows × 5 columns

Again we see the rows and the columns. In the last page we concentrated on the DataFrame columns, the .column attribute, and the column labels. This time we are going to concentrate on the rows.

Row labels#

The DataFrame has rows, and, in fact, the rows also have labels.

The row labels may not have been obvious from the display you have seen so far, because the row labels are numbers, so it would have been easy to mistake the row labels for row numbers.

You can see the row labels on the left hand side of the rows in the default display, above. For this default case, the label of the first row is 0, the label of the second row is 1, and so on.

Row labels are not the same as row numbers#

By default, when you load a DataFrame, Pandas will give each row a label that is a number, and that number is sequential by the order of the row in the file. That means that, by default, the row label will correspond to the row position. Here’s the DataFrame we got from the default pd.read_csv load of the data file:

gender_data

	country_name	country_code	gdp_us_billion	mat_mort_ratio	population
0	Aruba	ABW	NaN	NaN	0.103744
1	Afghanistan	AFG	19.961015	444.00	32.715838
2	Angola	AGO	111.936542	501.25	26.937545
3	Albania	ALB	12.327586	29.25	2.888280
4	Andorra	AND	3.197538	NaN	0.079547
...	...	...	...	...	...
211	Kosovo	XKX	6.804620	NaN	1.813820
212	Yemen, Rep.	YEM	36.819337	399.75	26.246608
213	South Africa	ZAF	345.209888	143.75	54.177209
214	Zambia	ZMB	24.280990	233.75	15.633220
215	Zimbabwe	ZWE	15.495514	398.00	15.420964

216 rows × 5 columns

Sure enough, the row at position 0 has label 0, the row at position 1 has label 1, and so on, all the way up to label 215 for the last (216th) row. The row labels happen to correspond to the row positions.

But in general, the row labels have no necessary relationship to the row positions. In fact, the row labels need not even be numbers.

You can see that the row labels need not correspond to position, when you drop some rows as we did in the data frame introduction page.

Let’s drop the missing values again:

gender_data_no_na = gender_data.dropna()
gender_data_no_na

	country_name	country_code	gdp_us_billion	mat_mort_ratio	population
1	Afghanistan	AFG	19.961015	444.00	32.715838
2	Angola	AGO	111.936542	501.25	26.937545
3	Albania	ALB	12.327586	29.25	2.888280
5	United Arab Emirates	ARE	375.027082	6.00	9.080299
6	Argentina	ARG	550.980968	53.75	42.976675
...	...	...	...	...	...
210	Samoa	WSM	0.799887	54.75	0.192225
212	Yemen, Rep.	YEM	36.819337	399.75	26.246608
213	South Africa	ZAF	345.209888	143.75	54.177209
214	Zambia	ZMB	24.280990	233.75	15.633220
215	Zimbabwe	ZWE	15.495514	398.00	15.420964

179 rows × 5 columns

We have now dropped the rows with label 0, 2 and 8, among others. The first row in the data frame (the row at position 0) now has label 1, the second row (position 1) has label 2, and the fourth (position 3) has label 5. The row labels no longer correspond to the row positions.

Row labels are in the `.index` attribute#

In introduction to data frames, we found that the Pandas houses the column labels in the .columns attribute of the DataFrame.

You can get the labels for the rows with the .index attribute:

gender_data_no_na.index

Index([  1,   2,   3,   5,   6,   7,  10,  11,  12,  13,
       ...
       203, 204, 205, 208, 209, 210, 212, 213, 214, 215],
      dtype='int64', length=179)

Notice that Pandas stores the row labels in an Index-type object (value), just as it stored the column labels in an Index object (value).

Just as for the .columns object, you can get the row labels by applying list to the .index attribute:

# Make the row labels into a list.
row_labels = list(gender_data_no_na.index)
# Show the first 10 labels
row_labels[:10]

[1, 2, 3, 5, 6, 7, 10, 11, 12, 13]

Changing the row labels#

As we said above, by default, the row labels are numbers, but they need not be numbers — they could be strings.

Sometimes it is useful to change the default numeric row labels to something more memorable to indicate the nature of the row.

For example, in our case, our rows correspond to countries. We might want the row label to remind us which country the row refers to. There is a column, country_code with a unique code for the country. We can use those values to replace the default numeric labels, using the .set_index method.

Remember, a method is a function attached to a value. (Technically, it is an attribute where the value of the attribute is a function). We use .set_index by passing the column name of the column we want to use.

labeled_gdata = gender_data_no_na.set_index('country_code')
labeled_gdata

	country_name	gdp_us_billion	mat_mort_ratio	population
country_code
AFG	Afghanistan	19.961015	444.00	32.715838
AGO	Angola	111.936542	501.25	26.937545
ALB	Albania	12.327586	29.25	2.888280
ARE	United Arab Emirates	375.027082	6.00	9.080299
ARG	Argentina	550.980968	53.75	42.976675
...	...	...	...	...
WSM	Samoa	0.799887	54.75	0.192225
YEM	Yemen, Rep.	36.819337	399.75	26.246608
ZAF	South Africa	345.209888	143.75	54.177209
ZMB	Zambia	24.280990	233.75	15.633220
ZWE	Zimbabwe	15.495514	398.00	15.420964

179 rows × 4 columns

Notice the new values to the left of the row are the corresponding country codes, instead of the numeric labels we had before. Notice too that Pandas pulled the country_code column out of the DataFrame, to avoid duplication. You can tell Pandas not to pull the column out of the DataFrame, by adding drop=False to the argument for set_index.