Advances in Alibaba's Digital Human (XiaoMi) Technology: Development, Construction, and Interaction
This article reviews Alibaba's XiaoMi digital human technology, covering its evolution since 2019, a six‑stage pipeline for building avatars, methods to enhance emotional, textual, vocal, and motion expressiveness, and approaches for improving long‑term interactive capabilities such as controllable script generation, multimodal QA, sign‑language translation, and intelligent behavior decision, culminating in the release of the MMTK multimodal algorithm library.
The presentation introduces the rapid growth of digital humans in the metaverse era, highlighting the commercial potential of virtual avatars and Alibaba's XiaoMi projects over the past two to three years.
1. Development History – Starting in 2019, XiaoMi explored large‑screen digital human applications, creating the first avatar for service halls and subway stations, and subsequently expanding to virtual hosts, assistants, public‑interest avatars, and cloud‑based solutions.
2. Building a Digital Human from Scratch – The process consists of six parts: birth (modeling, IP management, rendering), body control (voice, lip sync, facial and limb motion), environment perception (multimodal understanding), autonomous consciousness (personalized decision‑making), deployment (various scenarios like virtual hosts and sign‑language translation), and integration (continuous technology iteration).
3. Enhancing Expressiveness – To achieve richer emotional expression, the team proposes (a) personalized emotion analysis combining intent and continuous emotion parameters, (b) text style transfer using information extraction and Data2Text for fine‑grained control, (c) style‑controlled TTS with pitch, energy, duration, speaker embedding, and emotion labels, and (d) Text2Action for generating and synchronizing realistic motions.
4. Improving Interaction Capability – Long‑term interaction is addressed through (a) controllable live‑script generation in five stages (material acquisition, ordering, content linking, fluency, style rewriting), (b) multimodal QA integrating text, image, video, audio, and motion, (c) bidirectional sign‑language translation (gesture recognition, natural sign‑language ↔ language translation, gesture synthesis), and (d) intelligent behavior decision using behavior trees and reinforcement learning to create diverse, context‑aware actions.
5. Multimodal Algorithm Library – MMTK – The MMTK library provides plug‑and‑play models for over ten Alibaba digital‑human projects, featuring a layered, extensible architecture and contributions of nearly ten top‑conference papers.
Conclusion – The article recaps the commercial potential, construction pipeline, expressiveness enhancements, interaction improvements, and the MMTK library, and looks forward to more human‑like emotion expression and smarter interactive abilities in future digital‑human applications.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.