Backend Development 11 min read

Implementing Fuzzy Company Name Matching with MySQL RegExp in a Business Approval Workflow

This article describes a business approval scenario where a company name entered by a business user must be checked for duplicates, and explains how to implement fuzzy matching using MySQL RegExp, tokenization with IKAnalyzer, and Java service code to extract, preprocess, match, and rank results by relevance.

Java Architect Essentials

Aug 1, 2024

Implementing Fuzzy Company Name Matching with MySQL RegExp in a Business Approval Workflow

The goal is to build an approval process for company applications where a business user adds a company and an administrator reviews it, requiring a check for duplicate entries.

The core steps are extracting key information from the company name, tokenizing it, and performing fuzzy matching against existing records.

Three MySQL fuzzy search options are considered: LIKE (exact match, unsuitable), full‑text index (limited customizability), and REGEXP (supports arbitrary patterns). Because the dataset is small, REGEXP is chosen despite slightly lower performance.

Key code snippets:

/**
 * 匹配前去除公司名称的无意义信息
 * @param targetCompanyName
 * @return
 */
private String formatCompanyName(String targetCompanyName) {
    String regex = "(?<province>[^省]+自治区|.*?省|.*?行政区|.*?市)" +
                   "?(?<city>[^市]+自治州|.*?地区|.*?行政单位|.+盟|市辖区|.*?市|.*?县)" +
                   "?(?<county>[^(区|市|县|旗|岛)]+区|.*?市|.*?县|.*?旗|.*?岛)" +
                   "?(?<village>.*)";
    Matcher matcher = Pattern.compile(regex).matcher(targetCompanyName);
    while (matcher.find()) {
        // remove province, city, county etc.
    }
    // additional address removal using AddressUtil.ADDRESS
    return targetCompanyName;
}

public class AddressUtil {
    public static final String[][] ADDRESS = {
        {"北京"}, {"天津"}, {"安徽","安庆","蚌埠",...}, /* many provinces and cities */
    };
}

<!-- ikAnalyzer 中文分词器 -->
<dependency>
    <groupId>com.janeluo</groupId>
    <artifactId>ikanalyzer</artifactId>
    <version>2012_u6</version>
    <exclusions>...</exclusions>
</dependency>
<!-- lucene-queryParser 查询分析器模块 -->
<dependency>
    <groupId>org.apache.lucene</groupId>
    <artifactId>lucene-queryparser</artifactId>
    <version>7.3.0</version>
</dependency>

@Slf4j
public class IKAnalyzerSupport {
    public static List<String> iKSegmenterToList(String target) throws Exception {
        if (StringUtils.isEmpty(target)) return new ArrayList<>();
        List<String> result = new ArrayList<>();
        StringReader sr = new StringReader(target);
        IKSegmenter ik = new IKSegmenter(sr, true);
        Lexeme lex;
        while ((lex = ik.next()) != null) {
            result.add(lex.getLexemeText());
        }
        return result;
    }
}

private String splitWord(String targetCompanyName) {
    log.info("对处理后端公司名称进行分词");
    List<String> splitWord = new ArrayList<>();
    String result = targetCompanyName;
    try {
        splitWord = iKSegmenterToList(targetCompanyName);
        result = splitWord.stream().distinct().collect(Collectors.joining("|"));
        log.info("分词结果:{}", result);
    } catch (Exception e) {
        log.error("分词报错:{}", e.getMessage());
    }
    return result;
}

public JsonResult matchCompanyName(CompanyDTO companyDTO, String accessToken, String localIp) {
    String sourceCompanyName = companyDTO.getCompanyName();
    String targetCompanyName = sourceCompanyName;
    log.info("处理前公司名称:{}", targetCompanyName);
    targetCompanyName = targetCompanyName.replaceAll("[（]|[）]|[(]|[)]", "");
    targetCompanyName = targetCompanyName.replaceAll("[(集团|股份|有限|责任|分公司)]", "");
    if (!targetCompanyName.contains("银行")) {
        targetCompanyName = formatCompanyName(targetCompanyName);
    }
    String splitCompanyName = splitWord(targetCompanyName);
    List<Company> matchedCompany = companyRepository.queryMatchCompanyName(splitCompanyName, targetCompanyName);
    List<String> result = new ArrayList<>();
    for (Company c : matchedCompany) {
        result.add(c.getCompanyName());
        if (companyDTO.getCompanyId().equals(c.getCompanyId())) {
            result.remove(c.getCompanyName());
        }
    }
    return JsonResult.successResult(result);
}

@Query(value = "SELECT * FROM company WHERE isDeleted = '0' and companyName REGEXP ?1 ORDER BY length(REPLACE(companyName,?2,''))/length(companyName)", nativeQuery = true)
List<Company> queryMatchCompanyName(String companyNameRegex, String companyName);

The ordering uses LENGTH(companyName) and LENGTH(REPLACE(companyName, ?2, '')) to count keyword occurrences, ranking companies with more matches higher.

Finally, the article ends with a call to share the content and join a community for further architectural discussions.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

backend Java Database MySQL tokenization fuzzy-matching RegExp

Written by

Java Architect Essentials

Committed to sharing quality articles and tutorials to help Java programmers progress from junior to mid-level to senior architect. We curate high-quality learning resources, interview questions, videos, and projects from across the internet to help you systematically improve your Java architecture skills. Follow and reply '1024' to get Java programming resources. Learn together, grow together.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.