OmniVoice: Open‑Source TTS Model Clones Voices in 600+ Languages with a Single Architecture
OmniVoice, an open‑source TTS system from Xiaomi AI Lab, uses a minimalist bidirectional Transformer and LLM‑enhanced pre‑training to synthesize high‑quality speech in over 600 languages, outperforming commercial systems while offering fine‑grained control and fully public code and models.
OmniVoice is a new multilingual text‑to‑speech (TTS) model released by Xiaomi AI Lab that can synthesize speech in more than 600 languages using a single architecture.
The core of OmniVoice is an extremely simple design: a single bidirectional Transformer performs end‑to‑end text‑to‑speech conversion without separate text modeling, hierarchical token prediction, or mixed modules, making it the most straightforward non‑autoregressive TTS model to date.
Two key innovations enable its performance. First, a full‑codebook random masking strategy dramatically improves training efficiency and overall model capability. Second, large‑language‑model (LLM) parameters are incorporated during pre‑training, which substantially raises intelligibility and resolves the long‑standing “mis‑reading” problem in TTS.
To support multilingual synthesis, OmniVoice aggregates 50 open‑source speech corpora, processes them for noise reduction and quality filtering, and builds a training set covering 646 languages with a total duration of 580,000 hours. Because data volume varies widely across languages, a dynamic up‑sampling strategy is applied to low‑resource languages to preserve training effectiveness.
Benchmarking shows that on a 24‑language test set OmniVoice exceeds several commercial systems in both similarity and intelligibility. On a broader 102‑language test, its intelligibility approaches or surpasses that of real human speech, and even languages with less than 10 hours of training data achieve high‑quality synthesis, dramatically lowering the barrier for low‑resource language TTS.
The model also demonstrates cross‑language cloning: using a Chinese reference audio, it can generate natural‑sounding Japanese and Korean speech, illustrating the claim that “if you can speak one language, you can speak thousands.”
Beyond multilingual capability, OmniVoice provides extensive controllability: users can specify timbre attributes (gender, age, pitch, dialect, accent) without a reference audio, adapt noisy reference recordings to clean output, insert emotion symbols such as laughter or sighs, and correct pronunciations of homographs or proper nouns via simple token annotations.
All training and inference code, as well as model weights, are fully open source on GitHub, and the accompanying paper (arXiv:2604.00688) details the technical approach. Additional resources include a demo page, HuggingFace Space, and model repository for immediate experimentation.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
