Design and Architecture of AI Digital Human Live Streaming System
The paper presents a cloud‑native architecture for AI‑driven digital‑human live‑streaming, detailing three‑layer asset, interaction, and media modules, real‑time script and Q&A scheduling, fault‑tolerant rendering and control services, and demonstrates how virtual anchors can deliver continuous, lifelike 24/7 e‑commerce streams.
This article introduces the rapid development of AI digital‑human technology and its integration with high‑real‑time interactive live‑streaming scenarios. By leveraging digital‑human rendering, full‑stack AI capabilities, and powerful compute resources, virtual anchors can achieve lifelike facial expressions, body gestures, dialogue, and emotional feedback, enabling 24/7 e‑commerce live streaming that surpasses real human anchors.
Background : Since early 2022, digital humans have become a hot AI track in China. IDC predicts the market will reach 102.4 billion CNY by 2026. Various types of digital humans (entertainment‑oriented, enterprise‑level service, sign‑language interpreters, etc.) have demonstrated commercial value, especially in real‑time video live‑streaming.
Business Scenario : Baidu’s e‑commerce live‑streaming platform integrates digital‑human technology to provide continuous, controllable virtual anchors. This reduces operational costs and offers unique value such as uninterrupted 7×24 hour streaming.
Overall Architecture : The system consists of three layers – Digital‑Human Assets, Live‑Interaction, and Media Control. Assets include avatar generation, voice synthesis (TTS, multi‑voice), and motion control. Interaction supports real‑time audio‑video, script‑driven playbooks, AI‑driven Q&A, and hand‑over to real operators. Media control handles encoding, stream mixing, subtitles, and CDN distribution.
Key Components :
Rendering Engine (cloud‑rendered via DKE/Kubernetes, supporting UE4, Unity3D, custom engines).
Digital‑Human Driving Service – translates script and interaction tasks into DRML commands for the renderer.
Script System – manages material libraries (products, promotions, scenes) and composes scripts.
Queue Management – separate FIFO queues for scripted tasks and real‑time interactions.
Worker/Apiserver – provides HTTP APIs, performs leader election via Redis, and schedules live rooms.
Queue Scheduling : The master node consumes the script queue, sends DRML commands through a persistent WebSocket session, and updates execution offsets. A parallel interaction scheduler polls for real‑time events every 500 ms, inserts interruptible actions into the stream, and ensures synchronization with the script.
Live‑Room Control : Supports automatic and manual start/stop of live rooms. Automatic start triggers at the scheduled time; manual start can be invoked via API. The system monitors room status, manages long‑lived connections, and gracefully shuts down streams after all queues are exhausted.
Fault Tolerance : Includes multi‑instance rendering engine deployment, automatic WebSocket reconnection, leader election recovery, and state persistence for script and interaction offsets. Upon node failure, a new leader resumes from the last recorded offsets, and unfinished DRML commands are re‑executed.
Conclusion : The paper details a cloud‑native AI digital‑human solution for live‑streaming, highlighting its modular architecture, real‑time interaction capabilities, and robust disaster‑recovery mechanisms. Future work aims to improve realism, reduce latency, and expand to broader scenarios.
Baidu Geek Talk
Follow us to discover more Baidu tech insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.