Merging#
Show code cell content
# Don't change this cell; just run it.
import numpy as np
import pandas as pd
# Safe setting for Pandas. Needs Pandas version >= 1.5.
pd.set_option('mode.copy_on_write', True)
import matplotlib.pyplot as plt
# Make the plots look more fancy.
plt.style.use('fivethirtyeight')
Note
This page has content from the Joining_Tables_by_Columns notebook of an older version of the UC Berkeley data science course. See the Berkeley course section of the license file.
Often, data about the same individuals is maintained in more than one table. For example, one university office might have data about each student’s time to completion of degree, while another has data about the student’s tuition and financial aid.
To understand the students’ experience, it may be helpful to put the two datasets together. If the data are in two tables, each with one row per student, then we would want to put the columns together, making sure to match the rows so that each student’s information remains on a single row.
Let us do this in the context of a simple example, and then use the method with a larger dataset.
Suppose we have a data frame for different flavors of ice cream. Each flavor of ice cream comes with a rating that is in a separate table.
cones = pd.DataFrame()
cones['Flavor'] = ['strawberry', 'vanilla', 'chocolate', 'strawberry',
'chocolate']
cones['Price'] = [3.55, 4.75, 6.55, 5.25, 5.75]
cones
Flavor | Price | |
---|---|---|
0 | strawberry | 3.55 |
1 | vanilla | 4.75 |
2 | chocolate | 6.55 |
3 | strawberry | 5.25 |
4 | chocolate | 5.75 |
ratings = pd.DataFrame()
ratings['Flavor'] = ['strawberry', 'chocolate', 'vanilla']
ratings['Stars'] = [2.5, 3.5, 4]
ratings
Flavor | Stars | |
---|---|---|
0 | strawberry | 2.5 |
1 | chocolate | 3.5 |
2 | vanilla | 4.0 |
Each of the tables has a column that contains ice cream flavors. In both
cases, the column has the name Flavor
. The entries in these columns can be
used to link the two tables.
The method merge
creates a new table in which each cone in the cones
table
is augmented with the Stars information in the ratings
table. For each cone
in cones
, merge
finds a row in ratings
whose Flavor
matches the cone’s
Flavor
. We have to tell merge
to use the Flavor
column for matching,
using the on
keyword argument.
rated = cones.merge(ratings, on='Flavor')
rated
Flavor | Price | Stars | |
---|---|---|---|
0 | strawberry | 3.55 | 2.5 |
1 | vanilla | 4.75 | 4.0 |
2 | chocolate | 6.55 | 3.5 |
3 | strawberry | 5.25 | 2.5 |
4 | chocolate | 5.75 | 3.5 |
Each cone now has not only its price but also the rating of its flavor.
In general, a call to merge
that augments a table (say table1
) with
information from another table (say table2
) looks like this:
table1.merge(table2, on=column_for_merging)
In the case above, the matching columns have the name column name: Flavor
.
This need not be so. For example, let us rename the Flavor
column in ratings
to Kind
:
# Rename the 'Flavor' column to 'Kind'
ratings_renamed = ratings
ratings_renamed.columns = ['Kind', 'Stars']
ratings_renamed
Kind | Stars | |
---|---|---|
0 | strawberry | 2.5 |
1 | chocolate | 3.5 |
2 | vanilla | 4.0 |
Now we have to tell merge
the name of the column to merge on, for each data frame. The first data frame (cones
in our case) is called the left data frame. The second (ratings
in our case) is called the right data frame. Now the columns have different names in the left and right data frame, we have to use the left_on
and right_on
keywords.
rated_again = cones.merge(ratings_renamed, left_on='Flavor', right_on='Kind')
rated_again
Flavor | Price | Kind | Stars | |
---|---|---|---|---|
0 | strawberry | 3.55 | strawberry | 2.5 |
1 | vanilla | 4.75 | vanilla | 4.0 |
2 | chocolate | 6.55 | chocolate | 3.5 |
3 | strawberry | 5.25 | strawberry | 2.5 |
4 | chocolate | 5.75 | chocolate | 3.5 |
Here is a more general skeleton of a merge
between table1
and table2
, where the corresponding columns may have different names:
table1.merge(table2, left_on=table1_column, right_on=table2_column)
Now that we have done the merge, the new table rated
(or rated_again
)
allows us to work out the price per star, which you can think of as an
informal measure of value. Low values are good – they mean that you are paying
less for each rating star.
rated['$/Star'] = rated['Price'] / rated['Stars']
rated.sort_values('$/Star').head(3)
Flavor | Price | Stars | $/Star | |
---|---|---|---|---|
1 | vanilla | 4.75 | 4.0 | 1.187500 |
0 | strawberry | 3.55 | 2.5 | 1.420000 |
4 | chocolate | 5.75 | 3.5 | 1.642857 |
Though strawberry has the lowest rating among the three flavors, the less expensive strawberry cone does well on this measure because it doesn’t cost a lot per star.
Suppose there is a table of professional reviews of some ice cream cones, and we have found the average review for each flavor.
reviews = pd.DataFrame()
reviews['Flavor'] = ['vanilla', 'chocolate', 'vanilla', 'chocolate']
reviews['ProfStars'] = [5, 3, 5, 4]
reviews
Flavor | ProfStars | |
---|---|---|
0 | vanilla | 5 |
1 | chocolate | 3 |
2 | vanilla | 5 |
3 | chocolate | 4 |
Remember group by:
average_review = reviews.groupby('Flavor').mean()
average_review
ProfStars | |
---|---|
Flavor | |
chocolate | 3.5 |
vanilla | 5.0 |
Notice that the column that we grouped by — Flavor
— has become the Index (row labels).
We can merge cones
and average_review
by providing the labels of the
columns by which to merge. As you will see, Pandas treats the data frame Index
as being a column, for this purpose.
First we remind ourselves of the contents of cones
:
cones
Flavor | Price | |
---|---|---|
0 | strawberry | 3.55 |
1 | vanilla | 4.75 |
2 | chocolate | 6.55 |
3 | strawberry | 5.25 |
4 | chocolate | 5.75 |
Here is the result of the merge:
cones.merge(average_review, left_on='Flavor', right_on='Flavor')
Flavor | Price | ProfStars | |
---|---|---|---|
0 | vanilla | 4.75 | 5.0 |
1 | chocolate | 6.55 | 3.5 |
2 | chocolate | 5.75 | 3.5 |
Of course in this case the “columns” have the same name, and we can do the same thing with:
cones.merge(average_review, on='Flavor')
Flavor | Price | ProfStars | |
---|---|---|---|
0 | vanilla | 4.75 | 5.0 |
1 | chocolate | 6.55 | 3.5 |
2 | chocolate | 5.75 | 3.5 |
Notice that Flavor
is the Index (row labels) for average_review
, but Pandas
allows this, because it sees the Index name is Flavor
, and treats it as a
column.
For this reason, merge can also merge with a Series, because the Series can
have a name. At the moment average_review
is a Dataframe with one column:
ProfStars
. We can pull out this column as a Series. You may remember that
the Series also gets the Index, and a name, from the column name:
avg_rev_as_series = average_review['ProfStars']
avg_rev_as_series
Flavor
chocolate 3.5
vanilla 5.0
Name: ProfStars, dtype: float64
Because this Series has a name: Flavor
, and that is the name of its Index, we can do the same merge with this Series as we did with the Dataframe above:
cones.merge(avg_rev_as_series, on='Flavor')
Flavor | Price | ProfStars | |
---|---|---|---|
0 | vanilla | 4.75 | 5.0 |
1 | chocolate | 6.55 | 3.5 |
2 | chocolate | 5.75 | 3.5 |
What remains?#
Notice that, after our merge, the strawberry cones have disappeared. Merge is
pursuing a particular strategy here, and that is to look for labels that match
in the matching columns. None of the reviews are for strawberry cones, so there
is nothing to which the strawberry
rows can be merged. This might be what you
want, or it might not be — that depends on the analysis we are trying to
perform with the merged table. If it is not what you want, you may want to ask
merge to use a different strategy.
Merge strategies#
Let us reflect further on the choice that merge made above, when it dropped the
row for strawberry cones. As you saw above, by default, merge
looks for
labels that are present in both of the matching columns. This is the default
merge strategy, called an inner merge. We could also call this an
intersection merge strategy.
For this default inner merge strategy, merge
first found all the flavor
labels in cones['Flavor']
:
# Different values in cones['Flavor']
cone_flavors = cones['Flavor'].unique()
cone_flavors
array(['strawberry', 'vanilla', 'chocolate'], dtype=object)
Then it found all the flavors in average_review
'Flavor'
“column” (in this case it found the index):
# Different values in average_reviews 'Flavor' - here, the Index
review_flavors = average_review.index.unique()
review_flavors
Index(['chocolate', 'vanilla'], dtype='object', name='Flavor')
Next merge
found all the Flavor
values that are present in both data frames. We can call this the intersection of the two sets of values. Python has a set
type to work out intersections and other set operations.
flavors_in_both = set(cone_flavors).intersection(review_flavors)
flavors_in_both
{'chocolate', 'vanilla'}
Merge then throws away any rows in either table that don’t have one of these
intersection values in the matching columns. This is how we lost the strawberry
row from the cones
table.
This inner or intersection strategy is often useful — that is why it is the default. But we may want to do something different. For example, we may want to keep flavors that don’t have reviews in our merge result, but get a missing value for the review score. One way of doing that is the left merge strategy. Here merge keeps all rows from from the left data frame, but, for each row where there is no corresponding label in the right data frame, it fills the row values from the right data frame with missing values.
# A merge using the "left" strategy
cones.merge(average_review, on='Flavor', how='left')
Flavor | Price | ProfStars | |
---|---|---|---|
0 | strawberry | 3.55 | NaN |
1 | vanilla | 4.75 | 5.0 |
2 | chocolate | 6.55 | 3.5 |
3 | strawberry | 5.25 | NaN |
4 | chocolate | 5.75 | 3.5 |
Merging and column names#
Sometimes we find ourselves merging two data frames that have column names in common.
For example, imagine we had some user reviews from China:
chinese_reviews = pd.DataFrame()
chinese_reviews['Flavor'] = ['vanilla', 'chocolate', 'chocolate']
chinese_reviews['Stars'] = [4.5, 3.5, 4]
chinese_reviews
Flavor | Stars | |
---|---|---|
0 | vanilla | 4.5 |
1 | chocolate | 3.5 |
2 | chocolate | 4.0 |
Now imagine we want to merge this data frame into the rated
data frame. Here’s the rated
data frame:
rated
Flavor | Price | Stars | $/Star | |
---|---|---|---|---|
0 | strawberry | 3.55 | 2.5 | 1.420000 |
1 | vanilla | 4.75 | 4.0 | 1.187500 |
2 | chocolate | 6.55 | 3.5 | 1.871429 |
3 | strawberry | 5.25 | 2.5 | 2.100000 |
4 | chocolate | 5.75 | 3.5 | 1.642857 |
Notice that rated
has a Stars
column, and chinese_reviews
has a Stars
column. Let us see what merge
does in this situation:
china_rated = rated.merge(chinese_reviews, on='Flavor')
china_rated
Flavor | Price | Stars_x | $/Star | Stars_y | |
---|---|---|---|---|---|
0 | vanilla | 4.75 | 4.0 | 1.187500 | 4.5 |
1 | chocolate | 6.55 | 3.5 | 1.871429 | 4.0 |
2 | chocolate | 5.75 | 3.5 | 1.642857 | 3.5 |
3 | chocolate | 6.55 | 3.5 | 1.871429 | 3.5 |
4 | chocolate | 5.75 | 3.5 | 1.642857 | 4.0 |
Pandas detected that both data frames had a column called Stars
and has renamed them accordingly. The column from the left data frame (rated
) has an _x
suffix, to give Stars_x
. The corresponding column from the right data frame has a _y
suffix: Stars_y
.
You can change these suffixes with the suffixes
keyword argument:
rated.merge(chinese_reviews, on='Flavor', suffixes=['_left', '_right'])
Flavor | Price | Stars_left | $/Star | Stars_right | |
---|---|---|---|---|---|
0 | vanilla | 4.75 | 4.0 | 1.187500 | 4.5 |
1 | chocolate | 6.55 | 3.5 | 1.871429 | 4.0 |
2 | chocolate | 5.75 | 3.5 | 1.642857 | 3.5 |
3 | chocolate | 6.55 | 3.5 | 1.871429 | 3.5 |
4 | chocolate | 5.75 | 3.5 | 1.642857 | 4.0 |
And more#
There is much more information about merging in the Python Data Science Handbook merge section.
Note
This page has content from the Joining_Tables_by_Columns notebook of an older version of the UC Berkeley data science course. See the Berkeley course section of the license file.