Using FuzzyWuzzy for Fuzzy String Matching in Python
This tutorial explains how to use the Python FuzzyWuzzy library, which relies on Levenshtein distance, to perform fuzzy string matching for tasks such as normalizing province or company names, and provides complete code examples and practical applications.
In daily development, matching a field that may contain slight variations (e.g., different ways of writing a province name) often requires many custom rules.
This article introduces the FuzzyWuzzy library, a simple yet powerful tool that uses the Levenshtein (Edit) Distance algorithm to calculate similarity between two strings.
Installation can be done in an Anaconda Jupyter environment with:
<code>pip install -i https://pypi.tuna.tsinghua.edu.cn/simple FuzzyWuzzy</code>The fuzz module provides four main functions:
Simple ratio ( fuzz.ratio )
Partial ratio ( fuzz.partial_ratio )
Token sort ratio ( fuzz.token_sort_ratio )
Token set ratio ( fuzz.token_set_ratio )
For example, fuzz.ratio("河南省", "河南省") returns 100 , while fuzz.ratio("河南", "河南省") returns 80 . Partial ratio can handle substrings, and token‑based ratios ignore word order and duplicate tokens.
The process module is useful when the set of possible matches is limited; it returns the best‑matching strings and their scores. process.extract returns a list of tuples, whereas process.extractOne returns the single best match.
Two practical examples are presented:
1. Company name fuzzy matching
A function fuzzy_merge(df_1, df_2, key1, key2, threshold=90, limit=2) is defined to join two DataFrames on fuzzy‑matched company names. The function extracts matches with process.extract , filters by a similarity threshold, and adds the best match to a new matches column.
<code>def fuzzy_merge(df_1, df_2, key1, key2, threshold=90, limit=2):
s = df_2[key2].tolist()
m = df_1[key1].apply(lambda x: process.extract(x, s, limit=limit))
df_1['matches'] = m
m2 = df_1['matches'].apply(lambda x: [i[0] for i in x if i[1] >= threshold][0] if len([i[0] for i in x if i[1] >= threshold]) > 0 else '')
df_1['matches'] = m2
return df_1</code>2. Province name fuzzy matching
The same function can be applied to normalize province names that appear as abbreviations (e.g., "北京") or full forms (e.g., "北京市").
All code snippets are wrapped in code tags, and the article includes illustrative images (omitted here for brevity). The final section lists the complete function definition and required imports:
<code># fuzzy matching
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
def fuzzy_merge(df_1, df_2, key1, key2, threshold=90, limit=2):
"""
:param df_1: left DataFrame
:param df_2: right DataFrame
:param key1: column in df_1 to match
:param key2: column in df_2 to match
:param threshold: minimum similarity score to accept
:param limit: number of top matches to consider
:return: df_1 with a new 'matches' column containing the best fuzzy match
"""
s = df_2[key2].tolist()
m = df_1[key1].apply(lambda x: process.extract(x, s, limit=limit))
df_1['matches'] = m
m2 = df_1['matches'].apply(lambda x: [i[0] for i in x if i[1] >= threshold][0] if len([i[0] for i in x if i[1] >= threshold]) > 0 else '')
df_1['matches'] = m2
return df_1
# example usage
# df = fuzzy_merge(data, company, '公司名称', '公司名称', threshold=90)
# df</code>The article concludes that encapsulating fuzzy‑matching logic into reusable functions simplifies future data‑cleaning tasks.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.