Artificial Intelligence 12 min read

Advances in Alibaba's Digital Human Technology: Construction, Performance, Interaction, and the MMTK Multimodal Algorithm Library

This article reviews Alibaba's digital‑human (virtual avatar) research over the past few years, covering the product’s evolution, a six‑stage pipeline for building digital humans, solutions to key challenges in realism, multimodal interaction, and the open‑source MMTK algorithm library.

DataFunSummit
DataFunSummit
DataFunSummit
Advances in Alibaba's Digital Human Technology: Construction, Performance, Interaction, and the MMTK Multimodal Algorithm Library

As society moves toward the metaverse, traditional text‑based dialogue can no longer satisfy user needs; digital humans (virtual avatars) are emerging with huge commercial potential. The article introduces Alibaba’s "XiaoMi" digital‑human product line and its technical evolution since 2019.

1. History of Alibaba Digital Human – Starting with large‑screen avatars for service halls and subways, the team has expanded to virtual hosts, assistants, public‑interest avatars, and cloud‑based solutions, delivering multiple products over three years.

2. Building a Digital Human from Scratch – The process is divided into six parts: birth (modeling and IP management), body control (speech, lip‑sync, facial and limb motion), environment perception (multimodal understanding), autonomous consciousness (personalized decision‑making), deployment (various scenarios), and continuous integration.

The team notes three major challenges: higher realism requirements, intelligent and diverse behavior, and the lack of a complete algorithmic solution.

3. Enhancing Digital‑Human Expressiveness

Personalized emotion analysis: combining emotion with user intent, continuous emotion parameters, and personality scripts for varied behavior.

Text style transfer (StyleTransfer): using information extraction + Data2Text to modify text style while preserving content.

Style‑controlled TTS: adding pitch, energy, duration, speaker embedding, and emotion labels for richer speech.

Text‑to‑Action: generating, stitching, and synchronizing actions with audio.

4. Boosting Interaction Capability

Controllable live‑script generation: a five‑stage pipeline (material acquisition, ordering, content linking, smoothing, style rewriting) using knowledge graphs and Data2Text.

Multimodal QA: aligning text, image, video, audio, and motion to answer user queries.

Bidirectional sign‑language translation: gesture recognition, sign‑to‑text/text‑to‑sign translation, and gesture synthesis.

Intelligent behavior decision: behavior trees plus reinforcement learning to give each avatar distinct decision logic.

5. MMTK – Multimodal Algorithm Library – An open‑source toolbox containing over ten Alibaba digital‑human models, designed with a layered, plug‑in architecture, and backed by nearly ten top‑conference papers.

Conclusion – The article recaps the commercial promise of digital humans, the end‑to‑end construction pipeline, expressive enhancements, interaction techniques, and the MMTK library, while outlining future work toward more lifelike emotion and smarter interaction.

multimodal AIDigital Humanvirtual avatarSpeech Synthesisbehavior decisionEmotion Modeling
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.