Perception System Overview: Sensors, Fusion, Onboard Architecture, and Technical Challenges in Autonomous Driving
This article presents a comprehensive overview of autonomous driving perception, covering system fundamentals, sensor setups and fusion techniques, onboard processing architecture, and the key technical challenges such as precision‑recall balance, adverse weather, and small‑object detection.
Speaker: Li Yangguang, Pony.ai Tech Lead. Editor: An Xiaohong. Source: Pony.ai & DataFun AI Talk. Community: DataFun.
Perception Introduction
Sensor Setup & Sensor Fusion
Perception Onboard System
Perception Technical Challenges
1. Perception Introduction
The perception system ingests data from multiple sensors and high‑definition maps, processes them, and accurately perceives the vehicle's surrounding environment, providing downstream modules with obstacle positions, shapes, classes, velocities, and semantic understanding of special scenes such as construction zones, traffic lights, and signs.
Key subsystems include:
Sensors – installation, field of view, detection range, data throughput, calibration accuracy, and time synchronization.
Object detection and classification – high recall and precision using deep‑learning on 3D point clouds and 2D images, plus multi‑sensor fusion.
Multi‑target tracking – frame‑wise motion estimation of obstacles.
Scene understanding – traffic lights, signs, construction areas, special categories (school bus, police car).
Distributed training infrastructure and evaluation system for machine‑learning models.
Data – large annotated datasets of 3D point clouds and 2D images.
Primary sensor categories:
LiDAR (laser radar)
Camera
Radar (mmWave)
The above image shows typical perception output: detected vehicles, pedestrians, cyclists, and background map information.
By aggregating multiple frames, the system estimates the speed and direction of moving pedestrians and vehicles.
2. Sensor Setup & Sensor Fusion
The following describes Pony.ai's third‑generation vehicle sensor layout and fusion solution.
Our sensor suite provides 360° coverage with a perception range of up to 200 m. It includes three LiDAR units (top and two sides) with 100 m range, four wide‑angle cameras for full‑view imaging, a forward‑facing mmWave radar, and a long‑focus camera extending detection to 200 m. This configuration supports autonomous driving in residential, commercial, and industrial environments.
The sensor arrangement was first presented at the 2018 World AI Conference.
Two wide‑angle cameras and one long‑focus camera capture traffic‑light information up to 200 m.
Fusion begins with precise calibration of each sensor to a common coordinate system, including camera intrinsics, LiDAR‑to‑camera extrinsics, and radar‑to‑GPS extrinsics. High‑precision calibration is essential for both result‑level and metadata‑level fusion.
The image shows 3D LiDAR points projected onto camera images, demonstrating accurate calibration.
Calibration is fully automated. Steps include:
Camera intrinsic calibration – completed within 2–3 minutes per camera.
LiDAR‑to‑camera extrinsic calibration – LiDAR triggers camera exposure to achieve sub‑50 ms time synchronization across four cameras.
3D and 2D data complement each other, and their fusion yields more precise perception results.
3. Perception Onboard System
The onboard architecture synchronizes LiDAR, camera, and radar data within 50 ms, performs frame‑wise detection and classification, then applies multi‑frame tracking before outputting results. The solution emphasizes:
Safety – near‑100 % detection recall.
Precision – high thresholds to avoid false positives that degrade driving comfort.
Comprehensiveness – output of all useful information (signs, traffic lights, scene semantics).
Efficiency – near‑real‑time processing of massive sensor streams.
Scalability – ability to adapt models to new cities, countries, and larger datasets.
4. Perception Technical Challenges
Key challenging scenarios include:
Balancing precision and recall.
Long‑tail cases such as heavy traffic intersections, rain, water splash, small objects, and diverse traffic‑light designs.
Rain can cause LiDAR to detect water droplets; the system filters splash reflections using combined LiDAR‑camera data.
Long‑tail scenarios include water‑spraying trucks where perception must recognize mist and adjust vehicle behavior.
Detecting small objects such as stray cats or dogs is critical for safety.
Traffic‑light recognition must handle varied designs, countdown timers, and back‑light conditions; dynamic exposure adjustment mitigates glare.
Camera waterproofing addresses extreme weather challenges.
All images are used for educational purposes only.
Author Introduction
Li Yangguang, Pony.ai Tech Lead. Master's from Chinese Academy of Sciences, former roles at Baidu Ads Search and Autonomous Driving divisions, previously led perception system architecture. Currently responsible for autonomous driving perception research at Pony.ai.
Community & Recruitment
DataFun is a data‑intelligence community offering offline deep‑tech salons and online content curation, aiming to spread industrial experts' practical experience to practitioners.
For job opportunities at Pony.ai, follow the official account and submit your resume.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.