Fundamentals 11 min read

Using FuzzyWuzzy for Fuzzy String Matching in Python

This article introduces the Python FuzzyWuzzy library, explains its Levenshtein‑based fuzzy string matching functions such as Ratio, Partial Ratio, Token Sort Ratio and Token Set Ratio, demonstrates how to use the process module for extracting best matches, and provides practical code examples for matching company and province names.

Python Programming Learning Circle

Jul 10, 2023

Using FuzzyWuzzy for Fuzzy String Matching in Python

In daily development, we often need to match fields that have slight differences, such as province names written in various forms.

Introduction

We share the FuzzyWuzzy library, a simple and easy‑to‑use fuzzy string matching tool based on the Levenshtein distance algorithm.

FuzzyWuzzy Library Overview

FuzzyWuzzy

uses the Levenshtein Distance (also known as Edit Distance) to calculate the similarity between two strings; a smaller distance indicates higher similarity.

The library can be installed in an Anaconda Jupyter Notebook environment with the following command:

pip install -i https://pypi.tuna.tsinghua.edu.cn/simple FuzzyWuzzy

1. fuzz Module

The module provides four main functions: simple ratio (Ratio), partial ratio (Partial Ratio), token sort ratio (Token Sort Ratio), and token set ratio (Token Set Ratio).

Note: importing the module may raise a warning; installing python-Levenshtein improves speed.

1.1 Simple Ratio (Ratio)

fuzz.ratio("河南省", "河南省")  # 100
fuzz.ratio("河南", "河南省")    # 80

1.2 Partial Ratio

fuzz.partial_ratio("河南省", "河南省")  # 100
fuzz.partial_ratio("河南", "河南省")    # 100

1.3 Token Sort Ratio

fuzz.ratio("西藏 区域", "区域 西藏")  # 50
fuzz.token_sort_ratio("西藏 区域", "区域 西藏")  # 100

1.4 Token Set Ratio

fuzz.ratio("西藏 西藏 区域", "区域 西藏")  # 40
fuzz.token_set_ratio("西藏 西藏 区域", "区域 西藏")  # 100

process Module

The process module is used when the list of candidate strings is limited; it returns the fuzzy matched strings together with their similarity scores.

2.1 extract (multiple results)

choices = ["河南省", "郑州市", "湖北省", "武汉市"]
process.extract("郑州", choices, limit=2)
# [('郑州市', 90), ('河南省', 0)]

2.2 extractOne (single best result)

process.extractOne("郑州", choices)
# ('郑州市', 90)
process.extractOne("北京", choices)
# ('湖北省', 45)

3. Practical Applications

3.1 Company Name Fuzzy Matching

A function fuzzy_merge is defined to match company names between two DataFrames using process.extract and a similarity threshold.

# fuzzy matching function

def fuzzy_merge(df_1, df_2, key1, key2, threshold=90, limit=2):
    """
    :param df_1: left table to join
    :param df_2: right table to join
    :param key1: key column of the left table
    :param key2: key column of the right table
    :param threshold: minimum similarity (based on Levenshtein distance) to accept a match
    :param limit: number of top matches to return
    :return: DataFrame with a new 'matches' column
    """
    s = df_2[key2].tolist()
    m = df_1[key1].apply(lambda x: process.extract(x, s, limit=limit))
    df_1['matches'] = m
    m2 = df_1['matches'].apply(
        lambda x: [i[0] for i in x if i[1] >= threshold][0]
        if len([i[0] for i in x if i[1] >= threshold]) > 0 else ''
    )
    df_1['matches'] = m2
    return df_1

from fuzzywuzzy import fuzz, process

df = fuzzy_merge(data, company, '公司名称', '公司名称', threshold=90)

3.2 Province Field Fuzzy Matching

The same fuzzy_merge function can be applied to province names, allowing quick alignment of abbreviated and full province strings.

4. Full Function Code

# fuzzy matching

def fuzzy_merge(df_1, df_2, key1, key2, threshold=90, limit=2):
    """
    :param df_1: the left table to join
    :param df_2: the right table to join
    :param key1: key column of the left table
    :param key2: key column of the right table
    :param threshold: how close the matches should be to return a match, based on Levenshtein distance
    :param limit: the amount of matches that will get returned, sorted high to low
    :return: dataframe with both keys and matches
    """
    s = df_2[key2].tolist()
    m = df_1[key1].apply(lambda x: process.extract(x, s, limit=limit))
    df_1['matches'] = m
    m2 = df_1['matches'].apply(
        lambda x: [i[0] for i in x if i[1] >= threshold][0]
        if len([i[0] for i in x if i[1] >= threshold]) > 0 else ''
    )
    df_1['matches'] = m2
    return df_1

from fuzzywuzzy import fuzz
from fuzzywuzzy import process

df = fuzzy_merge(data, company, '公司名称', '公司名称', threshold=90)
df

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python pandas fuzzywuzzy Levenshtein data-cleaning string-matching

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.