Fundamentals 10 min read

Using FuzzyWuzzy for Fuzzy String Matching in Python

This article introduces the FuzzyWuzzy Python library, explains its underlying Levenshtein distance algorithm, demonstrates how to install it, describes the key functions in the fuzz and process modules, and provides practical examples for matching company names and province fields with complete code snippets.

Python Programming Learning Circle

Mar 22, 2024

Using FuzzyWuzzy for Fuzzy String Matching in Python

In daily data processing tasks, fields often contain slight variations (e.g., "Guangxi" vs. "Guangxi Zhuang Autonomous Region"), requiring flexible matching solutions. The FuzzyWuzzy library offers simple fuzzy string matching based on the Levenshtein (Edit) Distance algorithm.

Install the library in an Anaconda Jupyter environment with:

pip install -i https://pypi.tuna.tsinghua.edu.cn/simple FuzzyWuzzy

The fuzz module provides four primary ratio functions:

Simple Ratio: fuzz.ratio("河南省", "河南省") # >>> 100 Partial Ratio: fuzz.partial_ratio("河南", "河南省") # >>> 100 Token Sort Ratio (ignores word order):

fuzz.token_sort_ratio("西藏 自治区", "自治区 西藏")  # >>> 100

Token Set Ratio (removes duplicate tokens):

fuzz.token_set_ratio("西藏 西藏 自治区", "自治区 西藏")  # >>> 100

The process module handles limited-choice matching, returning the best matches and their similarity scores:

Extract multiple candidates:

choices = ["河南省", "郑州市", "湖北省", "武汉市"]

process.extract("郑州", choices, limit=2)  # >>> [('郑州市', 90), ('河南省', 0)]

Extract the single best match:

process.extractOne("郑州", choices)  # >>> ('郑州市', 90)

Two practical applications are demonstrated:

Fuzzy matching of company name fields, where a custom function fuzzy_merge(df_1, df_2, key1, key2, threshold=90, limit=2) merges two DataFrames based on fuzzy similarity, filters results by a threshold, and returns a DataFrame with a new matches column.

Fuzzy matching of province fields, applying the same function to standardize abbreviated region names to their full forms.

The full implementation of the fuzzy_merge function is:

# Fuzzy matching

def fuzzy_merge(df_1, df_2, key1, key2, threshold=90, limit=2):

"""

:param df_1: left table to join

:param df_2: right table to join

:param key1: key column of the left table

:param key2: key column of the right table

:param threshold: minimum similarity (based on Levenshtein) to accept a match

:param limit: number of top matches to consider

:return: DataFrame with matches column

"""

s = df_2[key2].tolist()

m = df_1[key1].apply(lambda x: process.extract(x, s, limit=limit))

df_1['matches'] = m

m2 = df_1['matches'].apply(lambda x: [i[0] for i in x if i[1] >= threshold][0] if len([i[0] for i in x if i[1] >= threshold]) > 0 else '')

df_1['matches'] = m2

return df_1

from fuzzywuzzy import fuzz

from fuzzywuzzy import process

df = fuzzy_merge(data, company, '公司名称', '公司名称', threshold=90)

This tutorial provides a concise yet complete guide to applying fuzzy string matching in Python for data cleaning and integration tasks.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

pandas fuzzywuzzy Levenshtein fuzzy logic data-cleaning string-matching

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.