The DataFrame and the index#
# Load the Pandas data science library, call it 'pd'
import pandas as pd
# Turn on a setting to use Pandas more safely.
# We will discuss this setting later.
pd.set_option('mode.copy_on_write', True)
In introduction to data frames, we introduced data frames, and looked at columns and column labels.
Here we load the same data from a file to give a DataFrame:
# Original data frame before dropping missing values.
gender_data = pd.read_csv('gender_stats_min.csv')
# Show the result
gender_data
country_name | country_code | gdp_us_billion | mat_mort_ratio | population | |
---|---|---|---|---|---|
0 | Aruba | ABW | NaN | NaN | 0.103744 |
1 | Afghanistan | AFG | 19.961015 | 444.00 | 32.715838 |
2 | Angola | AGO | 111.936542 | 501.25 | 26.937545 |
3 | Albania | ALB | 12.327586 | 29.25 | 2.888280 |
4 | Andorra | AND | 3.197538 | NaN | 0.079547 |
... | ... | ... | ... | ... | ... |
211 | Kosovo | XKX | 6.804620 | NaN | 1.813820 |
212 | Yemen, Rep. | YEM | 36.819337 | 399.75 | 26.246608 |
213 | South Africa | ZAF | 345.209888 | 143.75 | 54.177209 |
214 | Zambia | ZMB | 24.280990 | 233.75 | 15.633220 |
215 | Zimbabwe | ZWE | 15.495514 | 398.00 | 15.420964 |
216 rows × 5 columns
Again we see the rows and the columns. In the last page we concentrated on the
DataFrame columns, the .column
attribute, and the column labels. This time we
are going to concentrate on the rows.
Row labels#
The DataFrame has rows, and, in fact, the rows also have labels.
The row labels may not have been obvious from the display you have seen so far, because the row labels are numbers, so it would have been easy to mistake the row labels for row numbers.
You can see the row labels on the left hand side of the rows in the default display, above. For this default case, the label of the first row is 0, the label of the second row is 1, and so on.
Row labels are not the same as row numbers#
By default, when you load a DataFrame, Pandas will give each row a label that
is a number, and that number is sequential by the order of the row in the file.
That means that, by default, the row label will correspond to the row
position. Here’s the DataFrame we got from the default pd.read_csv
load of
the data file:
gender_data
country_name | country_code | gdp_us_billion | mat_mort_ratio | population | |
---|---|---|---|---|---|
0 | Aruba | ABW | NaN | NaN | 0.103744 |
1 | Afghanistan | AFG | 19.961015 | 444.00 | 32.715838 |
2 | Angola | AGO | 111.936542 | 501.25 | 26.937545 |
3 | Albania | ALB | 12.327586 | 29.25 | 2.888280 |
4 | Andorra | AND | 3.197538 | NaN | 0.079547 |
... | ... | ... | ... | ... | ... |
211 | Kosovo | XKX | 6.804620 | NaN | 1.813820 |
212 | Yemen, Rep. | YEM | 36.819337 | 399.75 | 26.246608 |
213 | South Africa | ZAF | 345.209888 | 143.75 | 54.177209 |
214 | Zambia | ZMB | 24.280990 | 233.75 | 15.633220 |
215 | Zimbabwe | ZWE | 15.495514 | 398.00 | 15.420964 |
216 rows × 5 columns
Sure enough, the row at position 0 has label 0, the row at position 1 has label 1, and so on, all the way up to label 215 for the last (216th) row. The row labels happen to correspond to the row positions.
But in general, the row labels have no necessary relationship to the row positions. In fact, the row labels need not even be numbers.
You can see that the row labels need not correspond to position, when you drop some rows as we did in the data frame introduction page.
Let’s drop the missing values again:
gender_data_no_na = gender_data.dropna()
gender_data_no_na
country_name | country_code | gdp_us_billion | mat_mort_ratio | population | |
---|---|---|---|---|---|
1 | Afghanistan | AFG | 19.961015 | 444.00 | 32.715838 |
2 | Angola | AGO | 111.936542 | 501.25 | 26.937545 |
3 | Albania | ALB | 12.327586 | 29.25 | 2.888280 |
5 | United Arab Emirates | ARE | 375.027082 | 6.00 | 9.080299 |
6 | Argentina | ARG | 550.980968 | 53.75 | 42.976675 |
... | ... | ... | ... | ... | ... |
210 | Samoa | WSM | 0.799887 | 54.75 | 0.192225 |
212 | Yemen, Rep. | YEM | 36.819337 | 399.75 | 26.246608 |
213 | South Africa | ZAF | 345.209888 | 143.75 | 54.177209 |
214 | Zambia | ZMB | 24.280990 | 233.75 | 15.633220 |
215 | Zimbabwe | ZWE | 15.495514 | 398.00 | 15.420964 |
179 rows × 5 columns
We have now dropped the rows with label 0, 2 and 8, among others. The first row in the data frame (the row at position 0) now has label 1, the second row (position 1) has label 2, and the fourth (position 3) has label 5. The row labels no longer correspond to the row positions.
Row labels are in the .index
attribute#
In introduction to data frames, we found that
the Pandas houses the column labels in the .columns
attribute of the
DataFrame.
You can get the labels for the rows with the .index
attribute:
gender_data_no_na.index
Index([ 1, 2, 3, 5, 6, 7, 10, 11, 12, 13,
...
203, 204, 205, 208, 209, 210, 212, 213, 214, 215],
dtype='int64', length=179)
Notice that Pandas stores the row labels in an Index
-type object (value),
just as it stored the column labels in an Index
object (value).
Just as for the .columns
object, you can get the row labels by applying
list
to the .index
attribute:
# Make the row labels into a list.
row_labels = list(gender_data_no_na.index)
# Show the first 10 labels
row_labels[:10]
[1, 2, 3, 5, 6, 7, 10, 11, 12, 13]
Changing the row labels#
As we said above, by default, the row labels are numbers, but they need not be numbers — they could be strings.
Sometimes it is useful to change the default numeric row labels to something more memorable to indicate the nature of the row.
For example, in our case, our rows correspond to countries. We might want the
row label to remind us which country the row refers to. There is a column,
country_code
with a unique code for the country. We can use those values to
replace the default numeric labels, using the .set_index
method.
Remember, a method is a function attached to a value. (Technically, it is
an attribute where the value of the attribute is a function). We use .set_index
by passing the column name of the column we want to use.
labeled_gdata = gender_data_no_na.set_index('country_code')
labeled_gdata
country_name | gdp_us_billion | mat_mort_ratio | population | |
---|---|---|---|---|
country_code | ||||
AFG | Afghanistan | 19.961015 | 444.00 | 32.715838 |
AGO | Angola | 111.936542 | 501.25 | 26.937545 |
ALB | Albania | 12.327586 | 29.25 | 2.888280 |
ARE | United Arab Emirates | 375.027082 | 6.00 | 9.080299 |
ARG | Argentina | 550.980968 | 53.75 | 42.976675 |
... | ... | ... | ... | ... |
WSM | Samoa | 0.799887 | 54.75 | 0.192225 |
YEM | Yemen, Rep. | 36.819337 | 399.75 | 26.246608 |
ZAF | South Africa | 345.209888 | 143.75 | 54.177209 |
ZMB | Zambia | 24.280990 | 233.75 | 15.633220 |
ZWE | Zimbabwe | 15.495514 | 398.00 | 15.420964 |
179 rows × 4 columns
Notice the new values to the left of the row are the corresponding country
codes, instead of the numeric labels we had before. Notice too that Pandas
pulled the country_code
column out of the DataFrame, to avoid duplication.
You can tell Pandas not to pull the column out of the DataFrame, by adding
drop=False
to the argument for set_index
.