End-to-End BEV+Transformer Perception and Modeling for High-Definition Map Production
By fusing LiDAR point clouds and camera images into a unified bird‑eye‑view space and applying Transformer‑based perception, multi‑sensor fusion, and graph‑diffusion modeling, the proposed BEV+Transformer framework automatically detects and smooths ground‑level line features and signs for high‑definition maps with centimeter‑level accuracy, boosting production efficiency and reducing cost.
Overview This article introduces the application of BEV+Transformer end‑to‑end perception and modeling techniques in various Gaode (Amap) business scenarios, focusing on the automation of ground‑level elements (line features and ground signs) in high‑definition (HD) maps.
The solution fuses data from multiple sensors on collection vehicles (LiDAR and cameras) across spatial and temporal dimensions, providing robust perception and modeling of road surface elements, thereby improving map production efficiency and reducing cost.
1. Business Analysis
HD maps are a critical foundation for autonomous driving, enhancing perception, decision‑making, and control. Ground elements in HD maps consist of line features (lane lines, road boundaries) and ground signs (guidance lines, zebra crossings, etc.). These elements support vehicle localization and path planning. Two main challenges are identified:
Higher positional accuracy requirements (centimeter‑level).
Difficulty in recognizing diverse and often degraded ground elements.
To meet the accuracy demand, collection vehicles are equipped with LiDAR (point clouds) and cameras (images). An illustration of a typical sensor layout (Nuscenes) is shown.
Ground elements also suffer from wear, occlusion, and varying reflectivity, which degrades point‑cloud and image quality. Additionally, line features require global topological consistency across local map tiles.
2. BEV+Transformer Technology Overview
Perception and modeling are performed in a Bird’s‑Eye‑View (BEV) space, where height information is less critical. The pipeline projects both LiDAR point clouds and camera images into a unified BEV plane, fuses them, and extracts features for high‑precision ground‑element detection.
The Transformer, originally successful in NLP and later in computer vision, is employed for view transformation, feature fusion, and instance perception.
3. Technical Solution Construction and Evolution
The proposed framework consists of three modules: local ground‑element perception, global line‑topology modeling, and line‑attribute change‑point detection.
Local Ground‑Element Perception Module
Three model variants are described:
GroundElement2Former – single‑sensor (LiDAR or camera) perception.
Fusion‑GroundElement2Former – multi‑sensor and temporal fusion.
Fusion‑SmoothGroundElement2Former – adds cross‑frame smoothness constraints.
1. GroundElement2Former
Processes either LiDAR or camera data separately. LiDAR points are orthographically projected to a BEV raster; camera images are transformed from perspective (PV) to BEV using IPM, corrected by a Transformer‑based PV2BEV module that leverages LiDAR‑derived ground height for accurate mapping. Deformable Attention aligns PV features with BEV queries.
Instance segmentation is performed with Mask2Former, followed by post‑processing: line features are skeletonized into vector points (NDS format) and ground signs are converted to oriented bounding boxes (OBB format) using OpenCV functions.
2. Fusion‑GroundElement2Former
Introduces two new modules:
Cross‑Sensor Fusion – aligns LiDAR and image BEV features via Deformable Attention with learned offsets.
Offline Temporal Fusion – aligns adjacent frames using affine transforms and concatenates BEV features along the channel dimension.
This design mitigates single‑sensor failures caused by adverse weather or occlusion.
3. Fusion‑SmoothGroundElement2Former
Addresses cross‑tile line smoothness by enforcing temporal consistency: the previous frame’s perception result is encoded as a mask prompt and fused with the current BEV feature before Mask2Former decoding, reducing missed detections and abrupt jumps.
Global Line Topology Modeling Module
Combines an Attention‑based Graph Neural Network (Attn‑GNN) for cross‑frame line matching with a Diffusion‑Model‑based smoothing stage (PolyDiffuse). The Attn‑GNN treats each line as a graph node, applying self‑attention for intra‑frame discrimination and cross‑attention for inter‑frame matching, followed by the Skinhorn algorithm for correspondence.
The diffusion stage refines the assembled topology, correcting errors introduced during perception or matching.
Line Attribute Change‑Point Module
Predicts precise locations and categories of attribute change points on lines using Deformable Attention‑enhanced queries. The module achieves ~20 cm positional accuracy.
4. Performance Evaluation and Conclusions
Experiments on simple and complex scenes demonstrate that Fusion‑SmoothGroundElement2Former delivers accurate and robust perception results, even under severe occlusion or sensor degradation. Global topology modeling produces smooth, complete line networks, and the approach also meets stringent localization requirements (0.2 m lateral, 1 m longitudinal error, >99 % accuracy).
5. Outlook
The current BEV perception framework serves HD‑map automation and multi‑sensor fusion localization. Future work aims to evolve the system into a universal road‑scene perceiver (Uni‑Road‑Perceiver) that can handle a broader set of road elements and support downstream tasks such as mapping and localization.
Amap Tech
Official Amap technology account showcasing all of Amap's technical innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.