Artificial Intelligence 10 min read

Multi-Stage Multi-Codebook VQ-VAE for High-Performance Neural Text-to-Speech (MSMC‑TTS)

The MSMC‑TTS system, a multi‑stage multi‑codebook VQ‑VAE based neural text‑to‑speech solution, delivers near‑human audio quality (MOS 4.41) with a compact 3.12 MB acoustic model, substantially surpassing Mel‑Spectrogram FastSpeech baselines in naturalness and efficiency.

Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Multi-Stage Multi-Codebook VQ-VAE for High-Performance Neural Text-to-Speech (MSMC‑TTS)

The Xiaohongshu multimedia intelligent algorithm team and the Chinese University of Hong Kong jointly proposed a high‑performance neural TTS solution called MSMC‑TTS, which is based on a multi‑stage, multi‑codebook VQ‑VAE representation. Compared with a Mel‑Spectrogram based FastSpeech baseline, MSMC‑TTS shows clear improvements in audio quality and naturalness.

Typical TTS pipelines consist of a feature extractor, an acoustic model, and a neural vocoder. When acoustic features such as Mel‑Spectrograms are predicted from text, a distribution gap often exists between predicted and real features, making it difficult for the vocoder to generate high‑quality audio.

The proposed approach uses a Vector‑Quantized Variational Auto‑Encoder (VQ‑VAE) whose encoder maps acoustic feature sequences to latent sequences that are quantized by several codebooks. This yields compact discrete representations, but higher quantization reduces completeness. To balance compactness and completeness, a Multi‑Head Vector Quantization (MHVQ) method splits each codebook into sub‑codebooks along the feature‑dimension axis, allowing higher utilization without increasing parameter count.

Multi‑stage multi‑codebook representation (MSMCR) is obtained by encoding acoustic features at multiple temporal resolutions through a series of encoders, then progressively decoding them. The resulting set of latent sequences, each with a different time resolution, forms the MSMCR used by the TTS system.

MSMC‑TTS consists of three parts: analysis, synthesis, and prediction. During training, audio is first converted to high‑completeness features (e.g., Mel‑Spectrograms), which train the VQ‑VAE‑based feature analyzer to produce MSMCR. The acoustic model and neural vocoder are then trained on these representations. At inference, the acoustic model predicts MSMCR from text, and the vocoder generates the final waveform.

A multi‑stage predictor, built on FastSpeech, first encodes the input text, upsamples it according to duration, then down‑samples to match the various MSMCR resolutions. Decoders progressively decode from low to high resolution, and a combination of MSE loss and triplet loss is used to encourage accurate codebook prediction.

Experiments on the public single‑speaker Nancy dataset (Blizzard Challenge 2011) show that MSMC‑TTS achieves a MOS of 4.41 (original recordings 4.50) while the Mel‑Spectrogram FastSpeech baseline scores 3.62 (tuned baseline 3.69). Moreover, MSMC‑TTS maintains high quality (MOS 4.47) with only 3.12 MB of acoustic‑model parameters, demonstrating lower modeling complexity.

Further analysis of model variants (V1: single‑stage single‑codebook, V2: V1 with 4‑head VQ, V3: V2 with two‑stage modeling) confirms that MHVQ improves completeness and that multi‑stage modeling yields the best TTS performance, both in naturalness and audio quality.

In summary, by focusing on compact speech representations, the proposed MSMC‑TTS framework delivers high‑quality neural TTS with reduced model size and complexity, outperforming mainstream Mel‑Spectrogram based FastSpeech systems.

VQ-VAECompact RepresentationMulti-Stage ModelingNeural TTSSpeech Synthesistext-to-speech
Xiaohongshu Tech REDtech
Written by

Xiaohongshu Tech REDtech

Official account of the Xiaohongshu tech team, sharing tech innovations and problem insights, advancing together.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.