Artificial Intelligence 20 min read

AI Sales Assistant: Few‑Shot Voice Cloning and Multi‑Accent Naturalness Optimization

The article presents 58 Tongcheng AI Lab's AI sales assistant, detailing its background, a few‑shot voice‑cloning pipeline built on real dialogue data, data preprocessing, FastSpeech2‑based acoustic modeling, multi‑accent style transfer, deployment architecture, controllable synthesis parameters, and future research directions.

DataFunSummit

Aug 15, 2023

AI Sales Assistant: Few‑Shot Voice Cloning and Multi‑Accent Naturalness Optimization

58 Tongcheng AI Lab has been focusing on conversational AI since 2017, developing products such as intelligent outbound calls, speech quality inspection, and speech synthesis, and unifying them into the "Lingxi" platform; the AI sales assistant is a key application that automates lead acquisition and significantly improves sales efficiency.

The assistant replaces manual lead‑claiming by automatically dialing leads, using multi‑turn dialogue to identify convertible opportunities, and evaluating performance with metrics such as answer‑outbound rate (the product of answer rate, outbound conversion rate, and compliance rate).

For voice cloning, the team leverages real sales call recordings, extracting 2‑10 s voice segments via VAD, up‑sampling to 16 kHz, and filtering with an open‑source quality‑scoring model (selecting segments scoring > 3 on two dimensions). Each of 32 cities receives 4‑5 speaker samples; the audio is transcribed with ASR, manually corrected, loudness‑normalized, and aligned using MFA to obtain phoneme durations for model training.

The front‑end text analysis pipeline performs text normalization (especially for English and numbers), tokenization, phoneme conversion, and prosody analysis, marking three pause levels (#1 word pause, #2 comma‑type pause, #3 sentence‑ending pause).

The acoustic model uses FastSpeech2 enhanced with speaker embeddings and a Conformer encoder for clearer articulation. The vocoder is a Multi‑Band MelGAN trained on a mixture of synthesized spectra and original audio, achieving a 94 % real‑time synthesis speed improvement over Tacotron2. Long‑sentence synthesis is stabilized by segmenting the text, masking spectra, and stitching the results.

Quality and stability improvements include noise and reverberation removal, careful handling of Trim‑Silence during alignment, and selective deletion of heavily noisy recordings. Additional experiments with super‑resolution and HIFI‑GAN were abandoned due to limited gains.

To enhance naturalness for multiple accents and spoken styles, the team applies audio‑quality optimization, pronunciation stability techniques, and a text‑style transfer model based on PromptCLUE/T5 that rewrites formal text into colloquial, accent‑aware versions, using prompts that specify speaker ID and desired style.

Deployment splits the system into a front‑end text‑analysis service and a back‑end acoustic‑model + vocoder service packaged together and served via TensorFlow Serving, achieving an average real‑time factor of 0.02. Controllable attributes such as speech speed, pitch, and precise pauses are handled by scaling FastSpeech2 durations and inserting silent frames in the spectrogram.

Future work aims to further improve naturalness and consistency, accelerate cloning training (including zero‑shot approaches), and continue refining noise‑robustness and multi‑accent capabilities.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

TTS speech synthesis voice cloning AI sales assistant multi-accent Fastspeech2

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.