Artificial Intelligence 8 min read

Metaverse-Based Virtual Humans: Technologies and Applications in Intelligent Q&A

This article explores the concept of the metaverse and virtual humans, detailing 3D modeling techniques, NLP-driven language understanding, streaming TTS, VR/AR interaction, AIGC content generation, and the deployment of a large‑model intelligent Q&A system with real‑time facial expression synthesis for virtual anchors.

HomeTech
HomeTech
HomeTech
Metaverse-Based Virtual Humans: Technologies and Applications in Intelligent Q&A

Facebook’s rebranding to Meta popularized the metaverse, a virtual world parallel to reality where users interact in real time through realistic avatars and environments.

Virtual humans are digitally synthesized personas, ranging from early examples like Hatsune Miku to modern AI‑driven avatars such as the CCTV virtual anchor Xiao C, Tsinghua’s virtual student Hua Zhibing, and Alibaba’s digital employee AYAYI.

3D Modeling : Virtual human models are created either by artists or through 3D reconstruction. Traditional methods include multi‑view vision, infrared (e.g., Microsoft Kinect), and laser scanning, each with trade‑offs in lighting sensitivity, cost, and precision. Recent advances leverage Neural Radiance Fields (NeRF) using deep networks to reconstruct unseen viewpoints.

NLP : Large language models like ChatGPT provide the language understanding and dialogue capabilities needed for virtual humans to converse naturally with users.

TTS : Streaming text‑to‑speech is employed to give virtual humans a spoken voice, offering low latency compared to non‑streaming approaches.

VR/AR : Immersive interaction is enhanced through head‑mounted VR devices and AR glasses, enabling 3D experiences beyond traditional screens.

AIGC : AI‑generated content (text, images, diffusion‑based visuals) powers the avatar’s speech, facial expressions, and gestures during interactions.

Virtual Anchor in Intelligent Q&A : In August 2022, AutoHome introduced the virtual digital person “Gong Jiuyu” as an AI experience officer, integrating a large‑model Q&A system that answers user queries quickly, improving engagement and retention.

The Q&A system combines a 6‑billion‑parameter model fine‑tuned on domain data with streaming output (≈30 ms per first token on a V100S GPU, ~25 tokens/s). For facial expression synthesis, the team adopted Wav2Lip, which uses a SyncNet discriminator and a LipGAN‑style generator, achieving 10 ms per frame and 25 FPS overall, supporting multiple concurrent users per GPU.

Author bios: Chen Xin (Business Intelligence – Intelligent Vehicle Team, focuses on image detection, AR/VR) and Wang Pengkai (Business Intelligence – Intelligent Vehicle Team, focuses on search Q&A systems and model optimization).

Artificial Intelligence3D modelingAIGCTTSNLPmetaverseVirtual HumanVR/AR
HomeTech
Written by

HomeTech

HomeTech tech sharing

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.