AI Audio Generation and Voice Synthesis Practices at Taobao
The article surveys Taobao’s AI‑generated audio pipeline, detailing eight technical papers on image‑to‑video, OpenAI o1, multimodal video, and large‑model voice synthesis, while highlighting advances like VALL‑E, CosyVoice, F5‑TTS, data‑cleaning methods, and e‑commerce applications such as voice‑cloned live streams, multilingual TTS, AI video‑audio integration, and audiobook production.
This article presents a comprehensive overview of AIGC (AI‑generated content) technologies applied to audio and voice synthesis within Taobao’s ecosystem. It begins by introducing a series of eight technical papers covering topics such as image‑to‑video generation, the OpenAI o1 model, multimodal video generation, and large‑model voice synthesis.
The background section highlights the resurgence of AI audio generation during the 2024 National Day period, noting the popularity of AI singers and the emergence of voice‑conversion (song‑style) and text‑to‑speech (TTS) techniques. Voice conversion is described as changing the timbre of a singing voice without altering rhythm, while TTS (also called speech synthesis) converts user‑provided text and voice style into speech.
Key technical advances are discussed, including the VALL‑E model, which adopts a language‑model‑based decoder and a residual‑vector‑quantization (RVQ) codec tokenizer to enable 3‑second reference‑audio voice cloning. The article also reviews CosyVoice, an open‑source large‑scale TTS model that uses a 3D‑Speaker embedding and Flow Matching for high‑quality synthesis, and newer models such as F5‑TTS that improve speed and pronunciation.
Data challenges are examined: high‑quality single‑speaker recordings are scarce, large‑scale datasets often contain background noise, multi‑speaker dialogue, or mixed‑language content, which can degrade model performance. The authors describe an automated cleaning pipeline that removes background music, denoises audio, and validates transcripts using cross‑ASR verification.
Practical e‑commerce applications are showcased, including: (1) voice‑cloning for marketing live streams, (2) multilingual (Chinese‑English) TTS for product descriptions, (3) AI‑generated video paired with AI‑generated sound effects and background music, and (4) integration of AI audio into audiobooks and short‑video content to increase user engagement.
The article concludes with a brief team introduction and links to additional reading on related technologies such as 3DXR, terminal tech, and data algorithms.
DaTaobao Tech
Official account of DaTaobao Technology
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.