Artificial Intelligence 19 min read

Vision‑Based Bird’s‑Eye‑View (BEV) Representation and Solutions for Autonomous Driving

Vision‑based Bird’s‑Eye‑View (BEV) transforms camera data into a scale‑invariant, geometry‑friendly top‑down map using perspective‑transformation modules such as Lift‑Splat‑Shoot and Pseudo‑LiDAR, incorporates deformable convolutions and attention, and underpins modern autonomous‑driving detectors like BEVDet, BEVDepth, Detr3D, BEVFormer and PETR, while future research targets depth‑estimation bottlenecks, multimodal transformer fusion, and foundation‑model generalization.

Amap Tech
Amap Tech
Amap Tech
Vision‑Based Bird’s‑Eye‑View (BEV) Representation and Solutions for Autonomous Driving

Bird’s‑Eye‑View (BEV) is a sensor‑data representation that has become a standard technique in autonomous driving, supporting high‑definition map element recognition, lane‑topology construction, and vehicle‑side fused localization.

Figure 1‑1: BEV application scenarios in Gaode (partial)

BEV offers two main advantages: (1) scale invariance – object size in BEV depends only on class, not distance; (2) decision‑friendly geometry – parallel lane lines remain parallel in BEV, unlike the converging perspective in PV space.

The core of BEV processing is the perspective‑transformation module, which converts data between the perspective view (PV) and BEV/3D spaces.

2D→3D conversion methods estimate a 3‑D representation from 2‑D pixels or features. The dominant approach is Lift‑Splat‑Shoot (LSS), which lifts image features to a depth‑aware frustum, splats them onto a BEV grid, and then shoots a task‑specific head. Pseudo‑LiDAR directly predicts a dense depth map and projects each pixel into 3‑D space.

Figure 1‑3: LSS pipeline

Figure 1‑4: Pseudo‑LiDAR pipeline

3D→2D conversion methods start from 3‑D points and retrieve corresponding 2‑D features. Explicit mapping projects predefined or learned 3‑D reference points onto the image plane. Implicit mapping lets the network learn the 3‑D‑to‑2‑D correspondence, often using deformable attention.

Figure 1‑5: Explicit mapping

Figure 1‑6: Implicit mapping

The BEV pipeline also relies on deformable modules. Deformable convolution adds learnable offsets to the sampling grid, making the convolution pattern adaptive. Deformable attention extends this idea to the query‑key mechanism, allowing the network to learn offset‑adjusted attention locations.

Figure 1‑7: Deformable convolution

Figure 1‑8: Deformable attention

Typical loss functions used in BEV‑based perception include Box Loss (L1 loss for bounding‑box regression) and Focal Loss (balanced cross‑entropy for classification). Both are illustrated in the original figures.

Representative BEV solutions

BEVDet adopts a four‑stage pipeline: image‑feature encoder (e.g., ResNet or Swin‑Transformer), LSS‑based view‑transformer (2D→3D), BEV encoder, and detection head. It uses Focal Loss for classification and L1 loss for bounding‑box regression.

Figure 2‑1: BEVDet architecture

BEVDepth improves depth estimation by adding a depth‑supervision branch (DepthNet) that predicts a dense depth map using camera intrinsics, SE‑weighted features, residual blocks, and deformable convolutions. The refined depth is used to generate more accurate 3‑D frustums, followed by voxel pooling and detection.

Figure 2‑2: Depth comparison (LSS vs. BEVDepth)

Figure 2‑3: BEVDepth pipeline

Detr3D extends the DETR framework to 3‑D detection by generating 3‑D reference points (object queries), projecting them onto the image plane, and aggregating image features. Detection heads use bipartite matching and the same loss functions as DETR.

Figure 2‑5: Detr3D architecture

BEVFormer incorporates temporal self‑attention and spatial cross‑attention to fuse multi‑frame BEV features, improving detection robustness. Deformable attention is used for both temporal alignment and spatial cross‑modal interaction.

Figure 2‑7: BEVFormer framework

PETR removes explicit 3‑D→2‑D projection by learning a direct mapping from 2‑D image features to 3‑D coordinates via a 3‑D position encoder and decoder. The pipeline includes an image encoder, 3‑D coordinate generator, 3‑D feature encoder, decoder, and detection head.

Figure 2‑10: PETR architecture

Summary and Outlook

Depth estimation remains the bottleneck for BEV perception; improving LSS, Pseudo‑LiDAR, laser‑distillation, stereo, or structure‑from‑motion methods is a promising direction. Multi‑modal sensor fusion via transformer‑based attention and CLIP‑style text‑image alignment is also gaining traction. Generalization across devices and datasets, as well as leveraging large‑scale foundation models for BEV tasks, are identified as future research challenges.

computer visiondeep learningautonomous driving3D PerceptionBEV
Amap Tech
Written by

Amap Tech

Official Amap technology account showcasing all of Amap's technical innovations.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.