Artificial Intelligence 11 min read

AI Audio Generation and Voice Synthesis Practices at Taobao

The article surveys Taobao’s AI‑generated audio pipeline, detailing eight technical papers on image‑to‑video, OpenAI o1, multimodal video, and large‑model voice synthesis, while highlighting advances like VALL‑E, CosyVoice, F5‑TTS, data‑cleaning methods, and e‑commerce applications such as voice‑cloned live streams, multilingual TTS, AI video‑audio integration, and audiobook production.

DaTaobao Tech

Mar 31, 2025

AI Audio Generation and Voice Synthesis Practices at Taobao

This article presents a comprehensive overview of AIGC (AI‑generated content) technologies applied to audio and voice synthesis within Taobao’s ecosystem. It begins by introducing a series of eight technical papers covering topics such as image‑to‑video generation, the OpenAI o1 model, multimodal video generation, and large‑model voice synthesis.

The background section highlights the resurgence of AI audio generation during the 2024 National Day period, noting the popularity of AI singers and the emergence of voice‑conversion (song‑style) and text‑to‑speech (TTS) techniques. Voice conversion is described as changing the timbre of a singing voice without altering rhythm, while TTS (also called speech synthesis) converts user‑provided text and voice style into speech.

Key technical advances are discussed, including the VALL‑E model, which adopts a language‑model‑based decoder and a residual‑vector‑quantization (RVQ) codec tokenizer to enable 3‑second reference‑audio voice cloning. The article also reviews CosyVoice, an open‑source large‑scale TTS model that uses a 3D‑Speaker embedding and Flow Matching for high‑quality synthesis, and newer models such as F5‑TTS that improve speed and pronunciation.

Data challenges are examined: high‑quality single‑speaker recordings are scarce, large‑scale datasets often contain background noise, multi‑speaker dialogue, or mixed‑language content, which can degrade model performance. The authors describe an automated cleaning pipeline that removes background music, denoises audio, and validates transcripts using cross‑ASR verification.

Practical e‑commerce applications are showcased, including: (1) voice‑cloning for marketing live streams, (2) multilingual (Chinese‑English) TTS for product descriptions, (3) AI‑generated video paired with AI‑generated sound effects and background music, and (4) integration of AI audio into audiobooks and short‑video content to increase user engagement.

The article concludes with a brief team introduction and links to additional reading on related technologies such as 3DXR, terminal tech, and data algorithms.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

e-commerce Large Language Model AI audio Data cleaning TTS voice synthesis

Written by

DaTaobao Tech

Official account of DaTaobao Technology

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.