Artificial Intelligence 20 min read

Low-Resource Text-to-Speech: FastSpeech, LightTTS, and LightBERT Overview

This article reviews recent advances in low‑resource text‑to‑speech synthesis, covering the background of TTS, challenges in data‑ and compute‑limited scenarios, and detailed descriptions of FastSpeech, LightTTS, LightBERT, and related lightweight vocoder techniques, along with experimental results and future research directions.

DataFunTalk

Nov 5, 2019

Low-Resource Text-to-Speech: FastSpeech, LightTTS, and LightBERT Overview

Background Neural network‑based end‑to‑end Text‑to‑Speech (TTS) has progressed rapidly, but data and computation constraints limit its deployment in low‑resource settings.

TTS System Components A typical TTS pipeline consists of three modules: the frontend (text normalization, grapheme‑to‑phoneme conversion, polyphone classification, prosody prediction), the acoustic model (e.g., Tacotron, FastSpeech, LightTTS), and the vocoder (Griffin‑Lim, WaveNet, WaveRNN, etc.).

Low‑Resource Challenges Limited paired text‑audio data and scarce compute resources hinder model training and online inference speed.

FastSpeech Introduces a parallel feed‑forward transformer architecture with a length regulator and duration predictor, achieving up to 270× speedup in mel‑spectrogram generation, improved robustness (no repeated or missing words), controllable speech rate, and comparable or better audio quality than autoregressive baselines.

LightTTS Targets scenarios with only a few hundred paired samples by leveraging denoising auto‑encoders, back‑translation between TTS and ASR, and bidirectional sequence modeling, achieving MOS scores close to fully supervised models.

LightBERT for TTS Frontend Applies a two‑stage knowledge distillation (pre‑training and fine‑tuning) to compress BERT‑based frontend models, reducing latency from 250 ms to 23 ms while preserving accuracy on polyphone classification.

LightPAFF Framework Generalizes the two‑stage distillation to various pre‑training/fine‑tuning models (BERT, GPT‑2, etc.), enabling lightweight yet high‑performing solutions across tasks.

Experimental Results FastSpeech, LightTTS, and LightBERT demonstrate significant speedups, robustness, and quality improvements under severe data and compute constraints.

Future Directions Explore lightweight vocoders, better utilization of noisy or multi‑speaker data, and further acceleration of offline training and online inference.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

text-to-speech low-resource FastSpeech LightTTS

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.