The DataFrame and the index#

# Load the Pandas data science library, call it 'pd'
import pandas as pd
# Turn on a setting to use Pandas more safely.
# We will discuss this setting later.
pd.set_option('mode.copy_on_write', True)

In introduction to data frames, we introduced data frames, and looked at columns and column labels.

Here we load the same data from a file to give a DataFrame:

# Original data frame before dropping missing values.
gender_data = pd.read_csv('gender_stats_min.csv')
# Show the result
gender_data
country_name country_code gdp_us_billion mat_mort_ratio population
0 Aruba ABW NaN NaN 0.103744
1 Afghanistan AFG 19.961015 444.00 32.715838
2 Angola AGO 111.936542 501.25 26.937545
3 Albania ALB 12.327586 29.25 2.888280
4 Andorra AND 3.197538 NaN 0.079547
... ... ... ... ... ...
211 Kosovo XKX 6.804620 NaN 1.813820
212 Yemen, Rep. YEM 36.819337 399.75 26.246608
213 South Africa ZAF 345.209888 143.75 54.177209
214 Zambia ZMB 24.280990 233.75 15.633220
215 Zimbabwe ZWE 15.495514 398.00 15.420964

216 rows × 5 columns

Again we see the rows and the columns. In the last page we concentrated on the DataFrame columns, the .column attribute, and the column labels. This time we are going to concentrate on the rows.

Row labels#

The DataFrame has rows, and, in fact, the rows also have labels.

The row labels may not have been obvious from the display you have seen so far, because the row labels are numbers, so it would have been easy to mistake the row labels for row numbers.

You can see the row labels on the left hand side of the rows in the default display, above. For this default case, the label of the first row is 0, the label of the second row is 1, and so on.

Row labels are not the same as row numbers#

By default, when you load a DataFrame, Pandas will give each row a label that is a number, and that number is sequential by the order of the row in the file. That means that, by default, the row label will correspond to the row position. Here’s the DataFrame we got from the default pd.read_csv load of the data file:

gender_data
country_name country_code gdp_us_billion mat_mort_ratio population
0 Aruba ABW NaN NaN 0.103744
1 Afghanistan AFG 19.961015 444.00 32.715838
2 Angola AGO 111.936542 501.25 26.937545
3 Albania ALB 12.327586 29.25 2.888280
4 Andorra AND 3.197538 NaN 0.079547
... ... ... ... ... ...
211 Kosovo XKX 6.804620 NaN 1.813820
212 Yemen, Rep. YEM 36.819337 399.75 26.246608
213 South Africa ZAF 345.209888 143.75 54.177209
214 Zambia ZMB 24.280990 233.75 15.633220
215 Zimbabwe ZWE 15.495514 398.00 15.420964

216 rows × 5 columns

Sure enough, the row at position 0 has label 0, the row at position 1 has label 1, and so on, all the way up to label 215 for the last (216th) row. The row labels happen to correspond to the row positions.

But in general, the row labels have no necessary relationship to the row positions. In fact, the row labels need not even be numbers.

You can see that the row labels need not correspond to position, when you drop some rows as we did in the data frame introduction page.

Let’s drop the missing values again:

gender_data_no_na = gender_data.dropna()
gender_data_no_na
country_name country_code gdp_us_billion mat_mort_ratio population
1 Afghanistan AFG 19.961015 444.00 32.715838
2 Angola AGO 111.936542 501.25 26.937545
3 Albania ALB 12.327586 29.25 2.888280
5 United Arab Emirates ARE 375.027082 6.00 9.080299
6 Argentina ARG 550.980968 53.75 42.976675
... ... ... ... ... ...
210 Samoa WSM 0.799887 54.75 0.192225
212 Yemen, Rep. YEM 36.819337 399.75 26.246608
213 South Africa ZAF 345.209888 143.75 54.177209
214 Zambia ZMB 24.280990 233.75 15.633220
215 Zimbabwe ZWE 15.495514 398.00 15.420964

179 rows × 5 columns

We have now dropped the rows with label 0, 2 and 8, among others. The first row in the data frame (the row at position 0) now has label 1, the second row (position 1) has label 2, and the fourth (position 3) has label 5. The row labels no longer correspond to the row positions.

Row labels are in the .index attribute#

In introduction to data frames, we found that the Pandas houses the column labels in the .columns attribute of the DataFrame.

You can get the labels for the rows with the .index attribute:

gender_data_no_na.index
Index([  1,   2,   3,   5,   6,   7,  10,  11,  12,  13,
       ...
       203, 204, 205, 208, 209, 210, 212, 213, 214, 215],
      dtype='int64', length=179)

Notice that Pandas stores the row labels in an Index-type object (value), just as it stored the column labels in an Index object (value).

Just as for the .columns object, you can get the row labels by applying list to the .index attribute:

# Make the row labels into a list.
row_labels = list(gender_data_no_na.index)
# Show the first 10 labels
row_labels[:10]
[1, 2, 3, 5, 6, 7, 10, 11, 12, 13]

Changing the row labels#

As we said above, by default, the row labels are numbers, but they need not be numbers — they could be strings.

Sometimes it is useful to change the default numeric row labels to something more memorable to indicate the nature of the row.

For example, in our case, our rows correspond to countries. We might want the row label to remind us which country the row refers to. There is a column, country_code with a unique code for the country. We can use those values to replace the default numeric labels, using the .set_index method.

Remember, a method is a function attached to a value. (Technically, it is an attribute where the value of the attribute is a function). We use .set_index by passing the column name of the column we want to use.

labeled_gdata = gender_data_no_na.set_index('country_code')
labeled_gdata
country_name gdp_us_billion mat_mort_ratio population
country_code
AFG Afghanistan 19.961015 444.00 32.715838
AGO Angola 111.936542 501.25 26.937545
ALB Albania 12.327586 29.25 2.888280
ARE United Arab Emirates 375.027082 6.00 9.080299
ARG Argentina 550.980968 53.75 42.976675
... ... ... ... ...
WSM Samoa 0.799887 54.75 0.192225
YEM Yemen, Rep. 36.819337 399.75 26.246608
ZAF South Africa 345.209888 143.75 54.177209
ZMB Zambia 24.280990 233.75 15.633220
ZWE Zimbabwe 15.495514 398.00 15.420964

179 rows × 4 columns

Notice the new values to the left of the row are the corresponding country codes, instead of the numeric labels we had before. Notice too that Pandas pulled the country_code column out of the DataFrame, to avoid duplication. You can tell Pandas not to pull the column out of the DataFrame, by adding drop=False to the argument for set_index.