Community Data Normalization Using Prefix Matching and Text Similarity
The study presents a four‑step pipeline that normalizes community data for rental platforms by clustering records using longest‑common‑prefix patterns, geographic filtering, Levenshtein similarity, and pattern‑based parent‑child assignment, achieving under 8 % false positives and 5 % false negatives.
Background: Accurate community (小区) information is crucial for rental platforms. Multiple data sources provide heterogeneous, redundant, and noisy records, making it necessary to aggregate and resolve hierarchical relationships among communities.
Goal: Identify and unify duplicate or synonymous community entries, distinguish parent communities from sub‑communities and building addresses, and supplement missing parent or sub‑community data.
Approach: The method leverages the observation that sub‑communities share a longest common prefix with their parent, and that community names follow recognizable patterns (e.g., "PP[数字|字母]区", "PP[数字]幢"). A four‑step pipeline is proposed:
Prefix‑based clustering to build approximate community trees.
Geographic filtering to remove communities whose GPS distance from the tree root exceeds 2 km.
Similarity‑driven re‑clustering using Levenshtein distance and GPS proximity (similarity > 2 and distance < 1 km) to merge synonymous trees.
Final normalization by merging overlapping trees and assigning parent‑child relationships via pattern matching.
Data Pre‑processing: Raw records are standardized by city/district formatting, GPS conversion to Gaode coordinates, and removal of punctuation, retaining only Chinese characters.
Evaluation: Using a manually labeled dataset from Gaode Maps, the algorithm achieves false‑positive rates below 8 % and false‑negative rates below 5 %, demonstrating reliable community normalization.
Conclusion: By exploiting naming conventions and hierarchical patterns, the proposed text‑matching and similarity analysis provides a simple yet accurate solution for community data normalization, enhancing search efficiency and data quality in real‑estate applications.
Xianyu Technology
Official account of the Xianyu technology team
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.