Often you may want to join together two datasets in pandas based on imperfectly matching strings. This is called fuzzy matching.
The easiest way to perform fuzzy matching in pandas is to use the get_close_matches() function from the difflib package.
The following example shows how to use this function in practice.
Example: Fuzzy Matching in Pandas
Suppose we have the following two pandas DataFrames that contain information about various basketball teams:
import pandas as pd
#create two DataFrames
df1 = pd.DataFrame({'team': ['Mavericks', 'Nets', 'Warriors', 'Heat', 'Lakers'],
'points': [99, 90, 104, 117, 100]})
df2 = pd.DataFrame({'team': ['Mavricks', 'Warrors', 'Heat', 'Netts', 'Lakes'],
'assists': [22, 29, 17, 40, 32]})
#view DataFrames
print(df1)
team points
0 Mavericks 99
1 Nets 90
2 Warriors 104
3 Heat 117
4 Lakers 100
print(df2)
team assists
0 Mavricks 22
1 Warrors 29
2 Heat 17
3 Netts 40
4 Lakes 32
Now suppose that we would like to merge the two DataFrames based on the team column.
Since the team names are slightly different between the two DataFrames, we must use fuzzy matching to find which team names most closely match.
We can use the get_close_matches() function from the difflib package to do so:
import difflib
#create duplicate column to retain team name from df2
df2['team_match'] = df2['team']
#convert team name in df2 to team name it most closely matches in df1
df2['team'] = df2['team'].apply(lambda x: difflib.get_close_matches(x, df1['team'])[0])
#merge the DataFrames into one
df3 = df1.merge(df2)
#view final DataFrame
print(df3)
team points assists team_match
0 Mavericks 99 22 Mavricks
1 Nets 90 40 Netts
2 Warriors 104 29 Warrors
3 Heat 117 17 Heat
4 Lakers 100 32 Lakes
The result is one data frame that contains each of the five team names from the first DataFrame along with the team that most closely matches from the second DataFrame.
The team_match column shows the team name from the second DataFrame that most closely matched the team name from the first DataFrame.
Note #1: By default, get_close_matches() returns the three closest matches. However, by using the [0] at the end of the lambda function we were able to only return the closest team name match.
Note #2: You can find the complete documentation for the get_close_matches() function here.
Additional Resources
The following tutorials explain how to perform other common tasks in pandas:
How to Merge Pandas DataFrames on Multiple Columns
How to Merge Two Pandas DataFrames on Index
Pandas Join vs. Merge: What’s the Difference?