Using FuzzyWuzzy for Fuzzy String Matching in Python
This article introduces the FuzzyWuzzy Python library, explains its underlying Levenshtein distance algorithm, demonstrates how to install it, describes the key functions in the fuzz and process modules, and provides practical examples for matching company names and province fields with complete code snippets.
In daily data processing tasks, fields often contain slight variations (e.g., "Guangxi" vs. "Guangxi Zhuang Autonomous Region"), requiring flexible matching solutions. The FuzzyWuzzy library offers simple fuzzy string matching based on the Levenshtein (Edit) Distance algorithm.
Install the library in an Anaconda Jupyter environment with:
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple FuzzyWuzzy
The fuzz module provides four primary ratio functions:
Simple Ratio: fuzz.ratio("河南省", "河南省") # >>> 100
Partial Ratio: fuzz.partial_ratio("河南", "河南省") # >>> 100
Token Sort Ratio (ignores word order): fuzz.token_sort_ratio("西藏 自治区", "自治区 西藏") # >>> 100
Token Set Ratio (removes duplicate tokens): fuzz.token_set_ratio("西藏 西藏 自治区", "自治区 西藏") # >>> 100
The process module handles limited-choice matching, returning the best matches and their similarity scores:
Extract multiple candidates:
choices = ["河南省", "郑州市", "湖北省", "武汉市"] process.extract("郑州", choices, limit=2) # >>> [('郑州市', 90), ('河南省', 0)]
Extract the single best match:
process.extractOne("郑州", choices) # >>> ('郑州市', 90)
Two practical applications are demonstrated:
Fuzzy matching of company name fields, where a custom function fuzzy_merge(df_1, df_2, key1, key2, threshold=90, limit=2) merges two DataFrames based on fuzzy similarity, filters results by a threshold, and returns a DataFrame with a new matches column.
Fuzzy matching of province fields, applying the same function to standardize abbreviated region names to their full forms.
The full implementation of the fuzzy_merge function is:
# Fuzzy matching
def fuzzy_merge(df_1, df_2, key1, key2, threshold=90, limit=2):
"""
:param df_1: left table to join
:param df_2: right table to join
:param key1: key column of the left table
:param key2: key column of the right table
:param threshold: minimum similarity (based on Levenshtein) to accept a match
:param limit: number of top matches to consider
:return: DataFrame with matches column
"""
s = df_2[key2].tolist()
m = df_1[key1].apply(lambda x: process.extract(x, s, limit=limit))
df_1['matches'] = m
m2 = df_1['matches'].apply(lambda x: [i[0] for i in x if i[1] >= threshold][0] if len([i[0] for i in x if i[1] >= threshold]) > 0 else '')
df_1['matches'] = m2
return df_1
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
df = fuzzy_merge(data, company, '公司名称', '公司名称', threshold=90)
This tutorial provides a concise yet complete guide to applying fuzzy string matching in Python for data cleaning and integration tasks.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.