GeoBERT: A Multi‑Task Pre‑trained Language Model for Chinese Address Text
This article introduces GeoBERT, a novel pre‑training method for Chinese address strings that leverages seven jointly constrained tasks to capture spatial semantics, administrative hierarchy, and similarity relationships, enabling downstream address classification, segmentation, POI extraction, similarity comparison, and authenticity verification with reduced annotation dependence.
With the rapid growth of big‑data and Geographic Information Systems (GIS), geocoding techniques have become essential for linking non‑spatial data to spatial contexts, and addresses serve as key textual carriers of geographic information for applications such as GEO‑BI, urban governance, and financial risk control.
This work proposes GeoBERT, the first language‑model pre‑training approach specifically designed for Chinese address texts. By jointly constraining seven auxiliary tasks, GeoBERT learns spatial semantic relations, administrative‑level elements, and hierarchical affiliations without relying on external mapping dictionaries, while preserving true address relationships in high‑dimensional space.
The vector representations produced by GeoBERT can be directly used as embeddings for downstream address‑related deep neural networks (e.g., classification, segmentation, POI extraction, similarity comparison, authenticity verification), thereby reducing the need for large labeled datasets, shortening training convergence time, and improving accuracy and recall.
1. Model Training Data Pre‑processing
Raw corpora consist of address strings and their IDs. Cleaning steps remove overly short/long entries, missing fields, convert full‑width characters to half‑width, and strip spaces, tabs, quotes, and Chinese punctuation. After stratified sampling for unbiased distribution, the data are shuffled and split into training, validation, and test sets.
Address pairs are constructed by randomly selecting a second address with probability p (e.g., 50 %) or keeping the original address with probability 1‑p. For each pair, province, city, and district labels are compared, and the longest common subsequence (LCS) length ratio is computed, forming the basic dataset (Table 1).
Each address pair is tokenized at the character level, padded or truncated to a maximum sequence length, and 15 % of tokens are masked following the BERT masking strategy ([MASK] 80 %, original 10 %, random token 10 %). Tokens are then converted to integer indices using the constructed character dictionary.
2. Pre‑training Language Model Construction
GeoBERT extends the original BERT architecture with seven auxiliary tasks:
Task 1 – “Same address?” binary classification using a dedicated [CLS1] token.
Task 2 – “Same physical object?” binary classification with [CLS2] .
Task 3 – Province‑match binary classification ( [CLS3] ).
Task 4 – City‑match binary classification ( [CLS4] ).
Task 5 – District‑match binary classification ( [CLS5] ).
Task 6 – Regression of LCS‑length‑to‑average‑length ratio using [CLS6] .
Task 7 – Standard masked‑character prediction (loss 7) as in BERT.
The overall loss is a weighted sum of the seven task losses (Figure 2).
3. Model Training
Training uses ~90 million address records on two V100 GPUs for six days, with batch size 64, max sequence length 120, max predictions per sequence 18, and learning rate 1e‑5. The process iterates through forward‑backward passes until the loss falls below a predefined threshold (Figure 3).
4. Application Scenarios
• Address data protection : Organizations can share the pre‑trained GeoBERT model instead of raw address data, enabling collaborative model updates while preserving privacy (Figure 4).
• Address authenticity verification : The address embeddings from GeoBERT serve as inputs to a fine‑tuned verification network, dramatically improving accuracy and recall while reducing manual effort.
• Address segmentation : By transferring the learned spatial semantics, a downstream segmentation model can extract administrative levels and meaningful tokens without extensive rule‑based dictionaries.
Future work will detail the segmentation model construction, training, and evaluation.
References
1. Vaswani A, et al. "Attention is All you Need". NIPS, 2017.
2. Devlin J, et al. "BERT: Pre‑training of Deep Bidirectional Transformers for Language Understanding". NAACL, 2019.
JD Tech Talk
Official JD Tech public account delivering best practices and technology innovation.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.