CVPR 2026: Learning Camera Pose from 10M Unlabeled Driving Videos
LA‑Pose shows that a model can acquire accurate camera pose estimation for autonomous driving by self‑supervised pretraining on roughly ten million unlabeled driving video clips and fine‑tuning with only a small amount of high‑quality 3D annotations, achieving over 10% accuracy gains while drastically reducing labeling cost.
Estimating camera pose between consecutive frames is a core geometric perception task for autonomous driving, but traditional approaches rely on expensive high‑quality 3D ground‑truth obtained from LiDAR, precise calibration, reconstruction pipelines, or simulators, limiting dataset diversity and inflating costs.
LA‑Pose: Two‑Stage Learning from Unlabeled Video
Wayve’s LA‑Pose replaces the need for massive 3D labels with a two‑stage pipeline. In the first stage, called Latent Action Pretraining, the team trains a reverse‑forward dynamics model on about 10 million unlabeled driving video segments. The model observes consecutive frames and learns a compact “latent action” representation that encodes how the visual scene changes over time (e.g., straight‑driving, left turn, right turn, stopping) without any pose supervision.
The second stage freezes the pretrained encoder and attaches a lightweight pose‑prediction head. This head is fine‑tuned on a small set of high‑quality 3D annotations, converting the latent action code into relative translation, rotation, field‑of‑view, and scale. The entire inference remains feed‑forward, suitable for real‑time deployment.
Emergent Motion Structure
When the learned latent actions are visualized in two dimensions, similar motions naturally cluster together, forming distinct regions for straight driving, left turn, right turn, and stopping. This indicates that the model captures geometric motion priors despite the absence of explicit 3D labels. Experiments also reveal that a 50‑dimensional bottleneck outperforms higher‑dimensional representations for downstream pose estimation because compression forces the model to discard irrelevant appearance information and retain essential motion structure.
Results: Higher Accuracy with Far Fewer Labels
On the Waymo and PandaSet autonomous‑driving benchmarks, LA‑Pose achieves more than a 10 % improvement in pose accuracy compared with recent feed‑forward baselines, while requiring orders of magnitude fewer 3D annotations. Moreover, the model retains its advantage on the unseen PandaSet split, demonstrating strong cross‑dataset generalization—crucial for real‑world deployment across diverse cities, road types, and weather conditions.
Limitations and Future Directions
The authors note that reverse‑motion (e.g., backing up) still degrades performance because such examples are scarce in the pretraining data. They propose expanding both pretraining and fine‑tuning datasets and extending the reverse‑dynamics pretraining to other video sources such as robot‑collected footage and handheld recordings.
Significance
LA‑Pose illustrates that geometric visual perception does not have to start from costly 3D annotation; the abundant motion signal inherent in everyday driving video can serve as a powerful self‑supervision source, potentially reshaping how autonomous‑driving systems acquire and scale their geometric understanding.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
