Artificial Intelligence 14 min read

Real‑time 3D Virtual Avatar Motion Capture Using a Single Monocular Camera

By leveraging AI‑driven 3D reconstruction from a single webcam, Bilibili’s new system captures facial expressions, full‑body pose, and hand gestures in real time, rendering photorealistic avatars on consumer‑grade PCs within milliseconds and dramatically reducing the cost and time required for virtual‑live streaming production.

Bilibili Tech

Jul 21, 2023

With the rapid rise of virtual live streaming on platforms such as Bilibili, interest in 3D photorealistic virtual avatars has surged. 3D hyper‑realistic virtual hosts on Douyin have attracted millions of fans within a week, indicating a market trend toward immersive virtual broadcasting.

The traditional production pipeline for 3D virtual avatars is costly (hundreds of thousands to millions of RMB) and time‑consuming (3‑6 months). It typically involves modeling, rigging, driving, and rendering, and relies on expensive optical motion‑capture rigs that can cost tens of thousands of RMB per broadcast.

To lower these barriers, Bilibili’s AI platform and Tiangong Studio jointly created a 3D realistic “Fantasy Star” digital‑human solution that enables ordinary users to create their own virtual idols. The solution is showcased in the “Technology Landing Cases” link.

Users can sculpt faces, shape bodies, and change outfits within minutes, using only a standard monocular webcam to capture facial expressions, body movements, and hand gestures, which are then rendered in real time on a virtual stage.

Background and Overview

As VR/AR applications proliferate, demand for motion capture grows. Existing motion‑capture technologies fall into three categories:

Optical motion capture – multiple cameras track markers in a controlled studio.

Inertial motion capture – sensors and gyroscopes capture movement without a dedicated studio.

Hybrid motion capture – combines optical markers and inertial sensors.

All three require additional hardware and are expensive for casual creators. The proposed solution uses pure computer‑vision techniques, requiring only a single camera to achieve full‑body, facial, and hand capture.

Technical Highlights

The core of the visual motion‑capture system is 3D reconstruction from 2D images. The pipeline processes an input video stream through facial expression estimation, 3D body reconstruction, and hand‑pose prediction. After confidence‑based post‑processing, the results are fed to an inverse‑kinematics (IK) module that generates joint‑rotation quaternions, which finally drive the rendering engine to animate the avatar.

Facial Expression Capture & 3D Face Reconstruction

High‑quality 3D facial data are traditionally expensive and collected in lab settings, leading to a domain gap. The team adopted self‑supervised training based on differentiable rendering. They collected a dataset of ~100k 3D point‑cloud + RGB pairs using an iPhone depth camera, and a million‑scale video dataset to improve generalization.

Key Methods

Pre‑training on natural‑scene data with 3D supervision, followed by adversarial training on the full dataset to inject facial priors.

Dense 2D annotations of facial landmarks to enhance expression detail and generalization.

Utilizing same‑person video clips to disentangle shape, pose, and expression, further boosting model robustness.

3D Body Reconstruction

Live half‑body streaming poses challenges such as frequent limb entry/exit, occlusion, motion blur, and complex backgrounds. The team tackled these from data, model, and adaptive post‑processing perspectives.

3D Body Core Dataset Construction

A semi‑automatic data‑collection system was built, yielding tens of millions of live‑scene samples, providing a solid foundation for stable motion capture.

Model Design & Optimization

To run on consumer‑grade PCs, the model undergoes distillation, pruning, and quantization. On an Nvidia RTX 2060, the full body‑capture pipeline completes in about 3 ms.

Adaptive Post‑Processing

Typical hand‑wave gestures cause occlusion and blur. Instead of low‑pass filtering (which introduces latency), a custom spatio‑temporal module detects anomalies, performs stable denoising, and preserves high‑frequency motion, delivering responsive and natural feedback.

Hand 3D Pose Estimation

Challenges include viewpoint variation, occlusion, dual‑hand interaction, data scarcity, and computational efficiency. The proposed solution combines a parametric hand model with a 3D heat‑map + offset regression branch. L2 consistency between the two branches enforces both morphological constraints and flexibility. Training uses ~500k frames with 3D labels and ~200k frames with 2D keypoints. The final system reaches state‑of‑the‑art accuracy and runs under 3 ms on a RTX 2060.

Future Directions

The next steps involve deeper integration of AI with virtual content creation, further lowering production thresholds and empowering more creators.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Artificial Intelligence real-time rendering 3D avatar Monocular Camera motion capture

Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.