Low-Resource Text-to-Speech: FastSpeech, LightTTS, and LightBERT Overview
This article reviews recent advances in low‑resource text‑to‑speech synthesis, covering the background of TTS, challenges in data‑ and compute‑limited scenarios, and detailed descriptions of FastSpeech, LightTTS, LightBERT, and related lightweight vocoder techniques, along with experimental results and future research directions.
Background Neural network‑based end‑to‑end Text‑to‑Speech (TTS) has progressed rapidly, but data and computation constraints limit its deployment in low‑resource settings.
TTS System Components A typical TTS pipeline consists of three modules: the frontend (text normalization, grapheme‑to‑phoneme conversion, polyphone classification, prosody prediction), the acoustic model (e.g., Tacotron, FastSpeech, LightTTS), and the vocoder (Griffin‑Lim, WaveNet, WaveRNN, etc.).
Low‑Resource Challenges Limited paired text‑audio data and scarce compute resources hinder model training and online inference speed.
FastSpeech Introduces a parallel feed‑forward transformer architecture with a length regulator and duration predictor, achieving up to 270× speedup in mel‑spectrogram generation, improved robustness (no repeated or missing words), controllable speech rate, and comparable or better audio quality than autoregressive baselines.
LightTTS Targets scenarios with only a few hundred paired samples by leveraging denoising auto‑encoders, back‑translation between TTS and ASR, and bidirectional sequence modeling, achieving MOS scores close to fully supervised models.
LightBERT for TTS Frontend Applies a two‑stage knowledge distillation (pre‑training and fine‑tuning) to compress BERT‑based frontend models, reducing latency from 250 ms to 23 ms while preserving accuracy on polyphone classification.
LightPAFF Framework Generalizes the two‑stage distillation to various pre‑training/fine‑tuning models (BERT, GPT‑2, etc.), enabling lightweight yet high‑performing solutions across tasks.
Experimental Results FastSpeech, LightTTS, and LightBERT demonstrate significant speedups, robustness, and quality improvements under severe data and compute constraints.
Future Directions Explore lightweight vocoders, better utilization of noisy or multi‑speaker data, and further acceleration of offline training and online inference.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.