Artificial Intelligence 18 min read

Evolution and Practice of Scene Text Recognition Technology in Gaode Map Data Production

Gaode Maps leverages advanced scene text recognition, evolving from traditional image processing to deep‑learning based detection and recognition pipelines, integrating multi‑stage models, data augmentation, and synthetic sample generation to achieve high‑accuracy, fast POI and road data automation.

DataFunTalk
DataFunTalk
DataFunTalk
Evolution and Practice of Scene Text Recognition Technology in Gaode Map Data Production

Background – Gaode Maps serves over a hundred million daily active users, and the richness and accuracy of its map data directly affect user experience. Traditional map data collection relied on manual editing of field‑collected assets, leading to slow updates and high costs. To automate data production, Gaode adopted image‑recognition techniques that extract map elements directly from massive image collections, focusing especially on POI (Point of Interest) and road data.

Scene Text Recognition (STR) is a critical component of this pipeline. Real‑world images present challenges such as diverse fonts, complex backgrounds, occlusions, and low‑quality captures from crowdsourced devices. The system must achieve both high recall (detecting as many text instances as possible) and high precision (over 99% accuracy for critical POI names and road signs).

Evolution of STR Technology

Traditional image‑processing era (pre‑2012) – Algorithms consisted of three stages: image preprocessing (region localization, rectification, segmentation), handcrafted feature extraction (e.g., HOG) or shallow CNNs, and classification using models like SVM. Post‑processing applied language models or rule‑based corrections. These methods required extensive hand‑tuning for each scenario and struggled with generalization.

Deep‑learning era (post‑2012) – The field shifted to end‑to‑end neural networks. Two main paradigms emerged:

Two‑stage pipelines: first detect text lines (using regression‑based, segmentation‑based, or hybrid methods), then recognize the content (CTC‑based or attention‑based decoders).

End‑to‑end models that jointly perform detection and recognition, improving speed and allowing mutual learning between tasks.

Gaode’s current framework combines the strengths of both approaches. It consists of three modules: text‑line detection, single‑character detection & recognition, and sequence recognition. The detection module predicts masks to handle arbitrary orientations and distortions; the sequence recognizer handles regular text lines, while the character‑level detector supplements difficult cases such as artistic fonts or heavily distorted characters.

Text‑line detection – Built on a two‑stage instance‑segmentation backbone, enhanced with deformable convolutions (DCN), enlarged mask features, and ASPP modules. Data augmentation (rotation, flip, mixup) is applied online to improve robustness. The model outputs both segmentation masks and minimal‑enclosing polygons for downstream OCR.

Evaluation on public benchmarks (ICDAR 2013, ICDAR 2017‑MLT, ICDAR 2019‑ReCTS) shows competitive scores, confirming the detector’s effectiveness for Gaode’s POI and road‑sign pipelines.

Recognition strategies

Single‑character detection & recognition – Uses a Faster R-CNN‑style detector and a SENet‑based recognizer covering >7,000 Chinese/English characters. Optimizations include identity‑mapping, MobileNetV2‑style skip connections, and extensive data augmentation, achieving a second‑place finish in the ICDAR 2019‑ReCTS competition.

Sequence recognition – Employs a TPS‑Inception‑BiLSTM‑Attention architecture. Images are rectified via TPS, padded to square inputs, and processed by a CNN backbone, followed by BiLSTM encoding and attention decoding. The model supports English, simplified Chinese, and traditional Chinese character sets and performs well on artistic or blurry text.

Sample mining & synthesis – To address rare characters and low‑frequency words, Gaode mines real‑world images containing such glyphs and manually annotates them. Additionally, synthetic data are generated via image‑rendering pipelines. Mixing real and synthetic samples markedly improves recognition of obscure characters.

Future directions and challenges

Data side – Automatic data expansion via advanced augmentation (e.g., AutoAugment) and style‑transfer synthesis (e.g., SwapText) to mitigate annotation scarcity.

Model side – Tackling blurred text using super‑resolution techniques (TextSR, GAN‑based feature SR) and integrating semantic priors from language models (e.g., SEED) to boost accuracy.

Edge deployment – Researching lightweight OCR models for on‑device inference to reduce cloud bandwidth and server load.

Overall, Gaode’s OCR system, through iterative algorithmic refinement and multi‑modal fusion, now automates over 70% of POI data creation and more than 90% of road‑information updates, dramatically lowering manual labor and operational costs.

Call to action – The article concludes by inviting readers to participate in the AMAP‑TECH algorithm competition, which focuses on dynamic road‑condition analysis from vehicle video streams.

deep learningOCRmap data automationscene text recognitionGaode
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.