Artificial Intelligence 17 min read

POI Signboard Image Retrieval: Technical Solution, Model Design, and Future Directions

To efficiently filter unchanged POI signboards, the authors propose a multimodal image‑retrieval system that combines enhanced global and local visual features with BERT‑encoded OCR text, using metric learning and alignment techniques to achieve over 95 % accuracy while handling occlusion, viewpoint variation, and subtle text changes.

Amap Tech
Amap Tech
Amap Tech
POI Signboard Image Retrieval: Technical Solution, Model Design, and Future Directions

Background POI (Point of Interest) data is a core component of electronic maps, providing name and location information for restaurants, shops, government agencies, tourist attractions, and transportation facilities. Accurate POI retrieval enables basic user functions such as destination search and navigation, while also supporting services like nearby search and reviews.

In the mapping platform, the POI data changes slowly over short intervals (less than a month). Most POIs remain unchanged, and only a few, such as a newly opened restaurant, appear as new entries.

Problem Definition Processing all POIs incurs high operational cost. Therefore, unchanged POIs must be filtered automatically, and the key technology for this filtering is image matching, which constitutes a typical image retrieval task.

Technical Definition Image retrieval is defined as: given a query image, search a large gallery for visually similar images. The core technique is metric learning, using losses such as contrastive loss, triplet loss, and center loss to pull same‑class samples together and push different‑class samples apart. Feature extraction (global, local, auxiliary) is essential, especially for tasks with strong text dependence.

Problem Characteristics

1. Heterogeneous data – POI signboards are captured by cameras of varying quality and viewpoints, leading to large differences in brightness, shape, and clarity.

2. Severe occlusion – Trees, vehicles, and other objects often block signboards, making feature extraction difficult.

3. Text dependency – Small changes in the POI name text should prevent a match, requiring the model to incorporate textual cues.

Technical Solution

The solution consists of data iteration and model optimization. Data generation includes a cold‑start automatic data creation using SIFT matching and a model‑iteration data mining pipeline that extracts training pairs from online manual work results.

The retrieval model is multimodal, comprising a visual branch and a text branch. Visual features are extracted by a dual‑branch network (global + local). Text features are obtained by applying BERT to OCR results of the signboard, with <SEP> tokens separating OCR outputs from different frames.

Global Feature Enhancements – Attention mechanisms (Spatial Group‑wise Enhance, SGE) are introduced to focus on discriminative regions, and the backbone is modified to retain more fine‑grained details (removing the last down‑sampling block). GeM pooling replaces global average pooling for more robust aggregation.

Local Feature Enhancements – The signboard is vertically split into several parts; each part’s feature map is pooled, and a similarity matrix is computed between parts of two images. The optimal alignment is found by minimizing the summed Euclidean distances (see Formula 1). This alignment improves retrieval under truncation, occlusion, and inaccurate bounding boxes.

Text Feature Integration – OCR results from multiple frames are concatenated with <SEP> tokens and encoded by BERT. The resulting text embedding is fused with visual features for the final metric‑learning objective (triplet loss with MDR regularization).

Model Performance – The multimodal system achieves >95 % accuracy and recall, significantly improving online metrics and inference speed. Qualitative examples show successful handling of previously hard cases such as similar‑looking signboards, occluded signs, and text‑only variations.

Future Work and Challenges

1. Data – Employ semi‑supervised and active learning to automatically discover and label corner cases, reducing manual annotation cost.

2. Model – Explore Transformer‑based backbones for both visual and textual modalities, leveraging their global receptive field and flexible multimodal fusion capabilities.

computer visiondeep learningpoimultimodal learningmetric learningimage retrieval
Amap Tech
Written by

Amap Tech

Official Amap technology account showcasing all of Amap's technical innovations.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.