Artificial Intelligence 20 min read

End-to-End BEV+Transformer Perception and Modeling for High-Definition Map Production

By fusing LiDAR point clouds and camera images into a unified bird‑eye‑view space and applying Transformer‑based perception, multi‑sensor fusion, and graph‑diffusion modeling, the proposed BEV+Transformer framework automatically detects and smooths ground‑level line features and signs for high‑definition maps with centimeter‑level accuracy, boosting production efficiency and reducing cost.

Amap Tech
Amap Tech
Amap Tech
End-to-End BEV+Transformer Perception and Modeling for High-Definition Map Production

Overview This article introduces the application of BEV+Transformer end‑to‑end perception and modeling techniques in various Gaode (Amap) business scenarios, focusing on the automation of ground‑level elements (line features and ground signs) in high‑definition (HD) maps.

The solution fuses data from multiple sensors on collection vehicles (LiDAR and cameras) across spatial and temporal dimensions, providing robust perception and modeling of road surface elements, thereby improving map production efficiency and reducing cost.

1. Business Analysis

HD maps are a critical foundation for autonomous driving, enhancing perception, decision‑making, and control. Ground elements in HD maps consist of line features (lane lines, road boundaries) and ground signs (guidance lines, zebra crossings, etc.). These elements support vehicle localization and path planning. Two main challenges are identified:

Higher positional accuracy requirements (centimeter‑level).

Difficulty in recognizing diverse and often degraded ground elements.

To meet the accuracy demand, collection vehicles are equipped with LiDAR (point clouds) and cameras (images). An illustration of a typical sensor layout (Nuscenes) is shown.

Ground elements also suffer from wear, occlusion, and varying reflectivity, which degrades point‑cloud and image quality. Additionally, line features require global topological consistency across local map tiles.

2. BEV+Transformer Technology Overview

Perception and modeling are performed in a Bird’s‑Eye‑View (BEV) space, where height information is less critical. The pipeline projects both LiDAR point clouds and camera images into a unified BEV plane, fuses them, and extracts features for high‑precision ground‑element detection.

The Transformer, originally successful in NLP and later in computer vision, is employed for view transformation, feature fusion, and instance perception.

3. Technical Solution Construction and Evolution

The proposed framework consists of three modules: local ground‑element perception, global line‑topology modeling, and line‑attribute change‑point detection.

Local Ground‑Element Perception Module

Three model variants are described:

GroundElement2Former – single‑sensor (LiDAR or camera) perception.

Fusion‑GroundElement2Former – multi‑sensor and temporal fusion.

Fusion‑SmoothGroundElement2Former – adds cross‑frame smoothness constraints.

1. GroundElement2Former

Processes either LiDAR or camera data separately. LiDAR points are orthographically projected to a BEV raster; camera images are transformed from perspective (PV) to BEV using IPM, corrected by a Transformer‑based PV2BEV module that leverages LiDAR‑derived ground height for accurate mapping. Deformable Attention aligns PV features with BEV queries.

Instance segmentation is performed with Mask2Former, followed by post‑processing: line features are skeletonized into vector points (NDS format) and ground signs are converted to oriented bounding boxes (OBB format) using OpenCV functions.

2. Fusion‑GroundElement2Former

Introduces two new modules:

Cross‑Sensor Fusion – aligns LiDAR and image BEV features via Deformable Attention with learned offsets.

Offline Temporal Fusion – aligns adjacent frames using affine transforms and concatenates BEV features along the channel dimension.

This design mitigates single‑sensor failures caused by adverse weather or occlusion.

3. Fusion‑SmoothGroundElement2Former

Addresses cross‑tile line smoothness by enforcing temporal consistency: the previous frame’s perception result is encoded as a mask prompt and fused with the current BEV feature before Mask2Former decoding, reducing missed detections and abrupt jumps.

Global Line Topology Modeling Module

Combines an Attention‑based Graph Neural Network (Attn‑GNN) for cross‑frame line matching with a Diffusion‑Model‑based smoothing stage (PolyDiffuse). The Attn‑GNN treats each line as a graph node, applying self‑attention for intra‑frame discrimination and cross‑attention for inter‑frame matching, followed by the Skinhorn algorithm for correspondence.

The diffusion stage refines the assembled topology, correcting errors introduced during perception or matching.

Line Attribute Change‑Point Module

Predicts precise locations and categories of attribute change points on lines using Deformable Attention‑enhanced queries. The module achieves ~20 cm positional accuracy.

4. Performance Evaluation and Conclusions

Experiments on simple and complex scenes demonstrate that Fusion‑SmoothGroundElement2Former delivers accurate and robust perception results, even under severe occlusion or sensor degradation. Global topology modeling produces smooth, complete line networks, and the approach also meets stringent localization requirements (0.2 m lateral, 1 m longitudinal error, >99 % accuracy).

5. Outlook

The current BEV perception framework serves HD‑map automation and multi‑sensor fusion localization. Future work aims to evolve the system into a universal road‑scene perceiver (Uni‑Road‑Perceiver) that can handle a broader set of road elements and support downstream tasks such as mapping and localization.

Transformerautonomous drivingperceptionsensor fusionBEVHD map
Amap Tech
Written by

Amap Tech

Official Amap technology account showcasing all of Amap's technical innovations.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.