OmniVoice: Open‑Source TTS Model Clones Voices in 600+ Languages with a Single Architecture

OmniVoice, an open‑source TTS system from Xiaomi AI Lab, uses a minimalist bidirectional Transformer and LLM‑enhanced pre‑training to synthesize high‑quality speech in over 600 languages, outperforming commercial systems while offering fine‑grained control and fully public code and models.

Xiaomi Tech
Xiaomi Tech
Xiaomi Tech
OmniVoice: Open‑Source TTS Model Clones Voices in 600+ Languages with a Single Architecture

OmniVoice is a new multilingual text‑to‑speech (TTS) model released by Xiaomi AI Lab that can synthesize speech in more than 600 languages using a single architecture.

The core of OmniVoice is an extremely simple design: a single bidirectional Transformer performs end‑to‑end text‑to‑speech conversion without separate text modeling, hierarchical token prediction, or mixed modules, making it the most straightforward non‑autoregressive TTS model to date.

Two key innovations enable its performance. First, a full‑codebook random masking strategy dramatically improves training efficiency and overall model capability. Second, large‑language‑model (LLM) parameters are incorporated during pre‑training, which substantially raises intelligibility and resolves the long‑standing “mis‑reading” problem in TTS.

To support multilingual synthesis, OmniVoice aggregates 50 open‑source speech corpora, processes them for noise reduction and quality filtering, and builds a training set covering 646 languages with a total duration of 580,000 hours. Because data volume varies widely across languages, a dynamic up‑sampling strategy is applied to low‑resource languages to preserve training effectiveness.

Benchmarking shows that on a 24‑language test set OmniVoice exceeds several commercial systems in both similarity and intelligibility. On a broader 102‑language test, its intelligibility approaches or surpasses that of real human speech, and even languages with less than 10 hours of training data achieve high‑quality synthesis, dramatically lowering the barrier for low‑resource language TTS.

The model also demonstrates cross‑language cloning: using a Chinese reference audio, it can generate natural‑sounding Japanese and Korean speech, illustrating the claim that “if you can speak one language, you can speak thousands.”

Beyond multilingual capability, OmniVoice provides extensive controllability: users can specify timbre attributes (gender, age, pitch, dialect, accent) without a reference audio, adapt noisy reference recordings to clean output, insert emotion symbols such as laughter or sighs, and correct pronunciations of homographs or proper nouns via simple token annotations.

All training and inference code, as well as model weights, are fully open source on GitHub, and the accompanying paper (arXiv:2604.00688) details the technical approach. Additional resources include a demo page, HuggingFace Space, and model repository for immediate experimentation.

OmniVoice model architecture
OmniVoice model architecture
Performance comparison on 24 languages
Performance comparison on 24 languages
Performance comparison on 102 languages
Performance comparison on 102 languages
CER vs training duration chart
CER vs training duration chart
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

TransformerOpen SourceTTSspeech cloningmultilingual speech synthesisOmniVoice
Xiaomi Tech
Written by

Xiaomi Tech

Chat about technology with Xiaomi and change life together.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.