Real‑time 3D Virtual Avatar Motion Capture Using a Single Monocular Camera
By leveraging AI‑driven 3D reconstruction from a single webcam, Bilibili’s new system captures facial expressions, full‑body pose, and hand gestures in real time, rendering photorealistic avatars on consumer‑grade PCs within milliseconds and dramatically reducing the cost and time required for virtual‑live streaming production.
With the rapid rise of virtual live streaming on platforms such as Bilibili, interest in 3D photorealistic virtual avatars has surged. 3D hyper‑realistic virtual hosts on Douyin have attracted millions of fans within a week, indicating a market trend toward immersive virtual broadcasting.
The traditional production pipeline for 3D virtual avatars is costly (hundreds of thousands to millions of RMB) and time‑consuming (3‑6 months). It typically involves modeling, rigging, driving, and rendering, and relies on expensive optical motion‑capture rigs that can cost tens of thousands of RMB per broadcast.
To lower these barriers, Bilibili’s AI platform and Tiangong Studio jointly created a 3D realistic “Fantasy Star” digital‑human solution that enables ordinary users to create their own virtual idols. The solution is showcased in the “Technology Landing Cases” link.
Users can sculpt faces, shape bodies, and change outfits within minutes, using only a standard monocular webcam to capture facial expressions, body movements, and hand gestures, which are then rendered in real time on a virtual stage.
Background and Overview
As VR/AR applications proliferate, demand for motion capture grows. Existing motion‑capture technologies fall into three categories:
Optical motion capture – multiple cameras track markers in a controlled studio.
Inertial motion capture – sensors and gyroscopes capture movement without a dedicated studio.
Hybrid motion capture – combines optical markers and inertial sensors.
All three require additional hardware and are expensive for casual creators. The proposed solution uses pure computer‑vision techniques, requiring only a single camera to achieve full‑body, facial, and hand capture.
Technical Highlights
The core of the visual motion‑capture system is 3D reconstruction from 2D images. The pipeline processes an input video stream through facial expression estimation, 3D body reconstruction, and hand‑pose prediction. After confidence‑based post‑processing, the results are fed to an inverse‑kinematics (IK) module that generates joint‑rotation quaternions, which finally drive the rendering engine to animate the avatar.
Facial Expression Capture & 3D Face Reconstruction
High‑quality 3D facial data are traditionally expensive and collected in lab settings, leading to a domain gap. The team adopted self‑supervised training based on differentiable rendering. They collected a dataset of ~100k 3D point‑cloud + RGB pairs using an iPhone depth camera, and a million‑scale video dataset to improve generalization.
Key Methods
Pre‑training on natural‑scene data with 3D supervision, followed by adversarial training on the full dataset to inject facial priors.
Dense 2D annotations of facial landmarks to enhance expression detail and generalization.
Utilizing same‑person video clips to disentangle shape, pose, and expression, further boosting model robustness.
3D Body Reconstruction
Live half‑body streaming poses challenges such as frequent limb entry/exit, occlusion, motion blur, and complex backgrounds. The team tackled these from data, model, and adaptive post‑processing perspectives.
3D Body Core Dataset Construction
A semi‑automatic data‑collection system was built, yielding tens of millions of live‑scene samples, providing a solid foundation for stable motion capture.
Model Design & Optimization
To run on consumer‑grade PCs, the model undergoes distillation, pruning, and quantization. On an Nvidia RTX 2060, the full body‑capture pipeline completes in about 3 ms.
Adaptive Post‑Processing
Typical hand‑wave gestures cause occlusion and blur. Instead of low‑pass filtering (which introduces latency), a custom spatio‑temporal module detects anomalies, performs stable denoising, and preserves high‑frequency motion, delivering responsive and natural feedback.
Hand 3D Pose Estimation
Challenges include viewpoint variation, occlusion, dual‑hand interaction, data scarcity, and computational efficiency. The proposed solution combines a parametric hand model with a 3D heat‑map + offset regression branch. L2 consistency between the two branches enforces both morphological constraints and flexibility. Training uses ~500k frames with 3D labels and ~200k frames with 2D keypoints. The final system reaches state‑of‑the‑art accuracy and runs under 3 ms on a RTX 2060.
Future Directions
The next steps involve deeper integration of AI with virtual content creation, further lowering production thresholds and empowering more creators.
Bilibili Tech
Provides introductions and tutorials on Bilibili-related technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.