Fundamentals 10 min read

Using FuzzyWuzzy for Fuzzy String Matching in Python

This article introduces the FuzzyWuzzy Python library, explains its Levenshtein‑based matching functions such as Ratio, Partial Ratio, Token Sort Ratio and Token Set Ratio, demonstrates how to install it, and provides practical code examples for merging company and province fields with fuzzy matching thresholds.

Python Programming Learning Circle

Jun 7, 2023

Using FuzzyWuzzy for Fuzzy String Matching in Python

In everyday data processing, fields often contain slight variations (e.g., "广西" vs "广西壮族自治区"), making exact matching cumbersome.

FuzzyWuzzy is a simple Python package that uses the Levenshtein Distance algorithm to compute similarity scores between strings.

Install the library in an Anaconda Jupyter environment with:

pip install -i https://pypi.tuna.tsinghua.edu.cn/simple FuzzyWuzzy

The fuzz module provides four main functions:

Ratio : basic similarity (e.g., fuzz.ratio("河南省", "河南省") # 100).

Partial Ratio : similarity of substrings (e.g., fuzz.partial_ratio("河南", "河南省") # 80).

Token Sort Ratio : ignores word order and punctuation (e.g.,

fuzz.token_sort_ratio("西藏 区域", "区域 西藏")  # 100

Token Set Ratio : removes duplicate tokens before comparison (e.g.,

fuzz.token_set_ratio("西藏 西藏 区域", "区域 西藏")  # 100

The process module is useful when the set of possible matches is limited; it returns the best‑matching strings and their scores.

Example of extracting the top two matches:

choices = ["河南省", "郑州市", "湖北省", "武汉市"]
process.extract("郑州", choices, limit=2)
# [('郑州市', 90), ('河南省', 0)]

For a single best match use process.extractOne:

process.extractOne("郑州", choices)
# ('郑州市', 90)

Two practical applications are demonstrated:

Fuzzy matching of company names between two dataframes.

Fuzzy matching of province/city names.

The core merging function fuzzy_merge encapsulates the workflow: it extracts candidate matches with a configurable threshold (default 90) and limit, filters results below the threshold, and returns a new DataFrame with a 'matches' column.

# Fuzzy merge function

def fuzzy_merge(df_1, df_2, key1, key2, threshold=90, limit=2):
    """
    :param df_1: left DataFrame
    :param df_2: right DataFrame
    :param key1: column in df_1 to match
    :param key2: column in df_2 to match
    :param threshold: minimum similarity score to accept
    :param limit: number of top matches to consider
    :return: DataFrame with a 'matches' column containing the best match
    """
    s = df_2[key2].tolist()
    m = df_1[key1].apply(lambda x: process.extract(x, s, limit=limit))
    df_1['matches'] = m
    m2 = df_1['matches'].apply(lambda x: [i[0] for i in x if i[1] >= threshold][0] if len([i[0] for i in x if i[1] >= threshold]) > 0 else '')
    df_1['matches'] = m2
    return df_1

from fuzzywuzzy import fuzz, process

# Example usage
# df = fuzzy_merge(data, company, '公司名称', '公司名称', threshold=90)
# df

By wrapping this function in a reusable module, developers can quickly apply fuzzy matching to various data‑cleaning tasks without rewriting code.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Data cleaning pandas string-matching

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.