Using FuzzyWuzzy for Fuzzy String Matching in Python
This article introduces the FuzzyWuzzy Python library, explains its Levenshtein‑based matching functions such as Ratio, Partial Ratio, Token Sort Ratio and Token Set Ratio, demonstrates how to install it, and provides practical code examples for merging company and province fields with fuzzy matching thresholds.
In everyday data processing, fields often contain slight variations (e.g., "广西" vs "广西壮族自治区"), making exact matching cumbersome.
FuzzyWuzzy is a simple Python package that uses the Levenshtein Distance algorithm to compute similarity scores between strings.
Install the library in an Anaconda Jupyter environment with:
<code>pip install -i https://pypi.tuna.tsinghua.edu.cn/simple FuzzyWuzzy</code>The fuzz module provides four main functions:
Ratio : basic similarity (e.g., fuzz.ratio("河南省", "河南省") # 100 ).
Partial Ratio : similarity of substrings (e.g., fuzz.partial_ratio("河南", "河南省") # 80 ).
Token Sort Ratio : ignores word order and punctuation (e.g., fuzz.token_sort_ratio("西藏 区域", "区域 西藏") # 100 ).
Token Set Ratio : removes duplicate tokens before comparison (e.g., fuzz.token_set_ratio("西藏 西藏 区域", "区域 西藏") # 100 ).
The process module is useful when the set of possible matches is limited; it returns the best‑matching strings and their scores.
Example of extracting the top two matches:
<code>choices = ["河南省", "郑州市", "湖北省", "武汉市"]
process.extract("郑州", choices, limit=2)
# [('郑州市', 90), ('河南省', 0)]
</code>For a single best match use process.extractOne :
<code>process.extractOne("郑州", choices)
# ('郑州市', 90)
</code>Two practical applications are demonstrated:
Fuzzy matching of company names between two dataframes.
Fuzzy matching of province/city names.
The core merging function fuzzy_merge encapsulates the workflow: it extracts candidate matches with a configurable threshold (default 90) and limit , filters results below the threshold, and returns a new DataFrame with a 'matches' column.
<code># Fuzzy merge function
def fuzzy_merge(df_1, df_2, key1, key2, threshold=90, limit=2):
"""
:param df_1: left DataFrame
:param df_2: right DataFrame
:param key1: column in df_1 to match
:param key2: column in df_2 to match
:param threshold: minimum similarity score to accept
:param limit: number of top matches to consider
:return: DataFrame with a 'matches' column containing the best match
"""
s = df_2[key2].tolist()
m = df_1[key1].apply(lambda x: process.extract(x, s, limit=limit))
df_1['matches'] = m
m2 = df_1['matches'].apply(lambda x: [i[0] for i in x if i[1] >= threshold][0] if len([i[0] for i in x if i[1] >= threshold]) > 0 else '')
df_1['matches'] = m2
return df_1
from fuzzywuzzy import fuzz, process
# Example usage
# df = fuzzy_merge(data, company, '公司名称', '公司名称', threshold=90)
# df
</code>By wrapping this function in a reusable module, developers can quickly apply fuzzy matching to various data‑cleaning tasks without rewriting code.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.