AI Sales Assistant: Few‑Shot Voice Cloning and Multi‑Accent Naturalness Optimization
The article presents 58 Tongcheng AI Lab's AI sales assistant, detailing its background, a few‑shot voice‑cloning pipeline built on real dialogue data, data preprocessing, FastSpeech2‑based acoustic modeling, multi‑accent style transfer, deployment architecture, controllable synthesis parameters, and future research directions.
58 Tongcheng AI Lab has been focusing on conversational AI since 2017, developing products such as intelligent outbound calls, speech quality inspection, and speech synthesis, and unifying them into the "Lingxi" platform; the AI sales assistant is a key application that automates lead acquisition and significantly improves sales efficiency.
The assistant replaces manual lead‑claiming by automatically dialing leads, using multi‑turn dialogue to identify convertible opportunities, and evaluating performance with metrics such as answer‑outbound rate (the product of answer rate, outbound conversion rate, and compliance rate).
For voice cloning, the team leverages real sales call recordings, extracting 2‑10 s voice segments via VAD, up‑sampling to 16 kHz, and filtering with an open‑source quality‑scoring model (selecting segments scoring > 3 on two dimensions). Each of 32 cities receives 4‑5 speaker samples; the audio is transcribed with ASR, manually corrected, loudness‑normalized, and aligned using MFA to obtain phoneme durations for model training.
The front‑end text analysis pipeline performs text normalization (especially for English and numbers), tokenization, phoneme conversion, and prosody analysis, marking three pause levels (#1 word pause, #2 comma‑type pause, #3 sentence‑ending pause).
The acoustic model uses FastSpeech2 enhanced with speaker embeddings and a Conformer encoder for clearer articulation. The vocoder is a Multi‑Band MelGAN trained on a mixture of synthesized spectra and original audio, achieving a 94 % real‑time synthesis speed improvement over Tacotron2. Long‑sentence synthesis is stabilized by segmenting the text, masking spectra, and stitching the results.
Quality and stability improvements include noise and reverberation removal, careful handling of Trim‑Silence during alignment, and selective deletion of heavily noisy recordings. Additional experiments with super‑resolution and HIFI‑GAN were abandoned due to limited gains.
To enhance naturalness for multiple accents and spoken styles, the team applies audio‑quality optimization, pronunciation stability techniques, and a text‑style transfer model based on PromptCLUE/T5 that rewrites formal text into colloquial, accent‑aware versions, using prompts that specify speaker ID and desired style.
Deployment splits the system into a front‑end text‑analysis service and a back‑end acoustic‑model + vocoder service packaged together and served via TensorFlow Serving, achieving an average real‑time factor of 0.02. Controllable attributes such as speech speed, pitch, and precise pauses are handled by scaling FastSpeech2 durations and inserting silent frames in the spectrogram.
Future work aims to further improve naturalness and consistency, accelerate cloning training (including zero‑shot approaches), and continue refining noise‑robustness and multi‑accent capabilities.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.