May 1, 2026 · Artificial Intelligence

How Speech Models Turn Waveforms into Computable Tokens

The article explains why speech tokenization is essential for large audio models, outlines three core challenges, compares five major tokenization paradigms—including neural codecs with vector quantization, self‑supervised learning with clustering, continuous embeddings, ASR‑derived text tokens, and hierarchical multi‑codebook tokens—and provides practical guidance for selecting the right approach based on task requirements and trade‑offs.

audio codechierarchical tokensself-supervised learning

0 likes · 11 min read

How Speech Models Turn Waveforms into Computable Tokens

Bilibili Tech

Jul 11, 2025 · Artificial Intelligence

IndexTTS2: Emotionally Expressive, Duration-Controlled Zero-Shot TTS

IndexTTS2 introduces a novel auto-regressive zero-shot text-to-speech model that achieves precise duration control and fine-grained emotional expression through a universal time‑encoding mechanism, decoupled voice‑style and emotion modeling, and a GPT‑style latent feature, outperforming state‑of‑the‑art baselines across multiple benchmarks.

duration controlemotional synthesisspeech generation

0 likes · 23 min read

IndexTTS2: Emotionally Expressive, Duration-Controlled Zero-Shot TTS