Qwen3-ASR: Open‑Source Speech Recognition Supporting 52 Languages and Dialects, Outperforming Whisper
The Qwen3‑ASR series, now open‑sourced by Alibaba, offers three models (1.7B, 0.6B, and a 0.6B forced aligner) that cover 52 languages and 22 Chinese dialects, support streaming and offline inference, achieve an RTF of 0.064 with 2000× realtime throughput, handle singing with background music, and provide detailed deployment guides, benchmarks, and comparisons with other ASR solutions.
Overview
The Qwen team has released the Qwen3‑ASR family as open‑source speech‑recognition models. The series consists of three models: Qwen3‑ASR‑1.7B (≈20 B parameters, flagship accuracy), Qwen3‑ASR‑0.6B (≈9 B parameters, balanced speed and accuracy), and Qwen3‑ForcedAligner‑0.6B (≈9 B parameters, provides timestamp alignment). Together they form a complete ASR pipeline from transcription to time‑stamp annotation.
Core Features
1. All‑in‑One Multilingual Support
The models recognize 52 languages, including 30 international languages (e.g., English, Arabic, French, Spanish, Japanese, Korean, Russian, etc.) and 22 Chinese dialects such as Mandarin, Cantonese (HK and Guangdong), Shanghainese, Sichuanese, and many regional varieties.
2. Extreme Performance
RTF (real‑time factor) as low as 0.064 means the 0.6B model can process roughly 15 seconds of audio per second. Under 128 concurrent requests the throughput reaches 2000 seconds of audio per second (3210 ms latency). Official benchmark data:
Concurrency 1: RTF 0.0094, throughput 106, TTFT 92 ms
Concurrency 8: RTF 0.0147, throughput 543, TTFT 228 ms
Concurrency 32: RTF 0.0291, throughput 1099, TTFT 820 ms
Concurrency 128: RTF 0.0640, throughput 2000 , TTFT 3210 ms
3. Singing and BGM Recognition
Both models can transcribe singing voice and songs with background music, a scenario that traditionally degrades ASR performance. The improvement is attributed to large‑scale training data and reinforcement‑learning fine‑tuning.
4. Unified Streaming + Offline Inference
A single model can run in streaming mode for low‑latency use cases (e.g., subtitles, voice assistants) or offline mode for long audio up to 20 minutes per segment.
Architecture Design
Qwen3‑ASR builds on the Qwen3‑Omni foundation model and introduces the Audio Transformer (AuT) encoder:
AuT encoder downsamples FBank features by 8×, producing 12.5 Hz audio tokens.
Dynamic Flash attention window adjusts from 1 s to 8 s, supporting both streaming and offline inference.
Qwen3 LM serves as a powerful language model decoder, enabling multilingual understanding.
The training pipeline consists of four stages:
AuT pre‑training on ~40 million hours of pseudo‑labeled ASR data.
Omni multimodal pre‑training on 3 trillion tokens.
Supervised ASR fine‑tuning with multilingual data for style transfer.
ASR reinforcement learning (GSPO) to improve noise robustness and transcription stability.
Quick Start
Environment Installation
# Create virtual environment
conda create -n qwen3-asr python=3.12 -y
conda activate qwen3-asr
# Core installation (transformers backend)
pip install -U qwen-asr
# Optional vLLM backend (recommended for speed)
pip install -U qwen-asr[vllm]
# Install FlashAttention2 for further acceleration
pip install -U flash-attn --no-build-isolationBasic Usage
import torch
from qwen_asr import Qwen3ASRModel
# Load the 1.7B model
model = Qwen3ASRModel.from_pretrained(
"Qwen/Qwen3-ASR-1.7B",
dtype=torch.bfloat16,
device_map="cuda:0",
max_inference_batch_size=32,
max_new_tokens=256,
)
# Transcribe an audio file (language auto‑detected)
results = model.transcribe(
audio="https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_en.wav",
language=None,
)
print(results[0].language) # -> English
print(results[0].text)Timestamp‑Enabled Transcription
model = Qwen3ASRModel.from_pretrained(
"Qwen/Qwen3-ASR-1.7B",
dtype=torch.bfloat16,
device_map="cuda:0",
forced_aligner="Qwen/Qwen3-ForcedAligner-0.6B",
forced_aligner_kwargs={"dtype": torch.bfloat16, "device_map": "cuda:0"},
)
results = model.transcribe(
audio=[".../asr_zh.wav", ".../asr_en.wav"],
language=["Chinese", "English"],
return_time_stamps=True,
)
for r in results:
print(r.language, r.text, r.time_stamps[0])vLLM Backend (Faster Inference)
if __name__ == "__main__":
model = Qwen3ASRModel.LLM(
model="Qwen/Qwen3-ASR-1.7B",
gpu_memory_utilization=0.7,
max_inference_batch_size=128,
max_new_tokens=4096,
forced_aligner="Qwen/Qwen3-ForcedAligner-0.6B",
forced_aligner_kwargs={"dtype": torch.bfloat16, "device_map": "cuda:0"},
)
results = model.transcribe(audio=["audio1.wav", "audio2.wav"], language=None, return_time_stamps=True)
for r in results:
print(r.language, r.text)vLLM Deployment
Deploy with a single command: vllm serve Qwen/Qwen3-ASR-1.7B Or use the official wrapper:
qwen-asr-serve Qwen/Qwen3-ASR-1.7B \
--gpu-memory-utilization 0.8 \
--host 0.0.0.0 \
--port 8000OpenAI‑compatible API
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
response = client.chat.completions.create(
model="Qwen/Qwen3-ASR-1.7B",
messages=[{"role": "user", "content": [{"type": "audio_url", "audio_url": {"url": "https://.../asr_en.wav"}}]}],
)
print(response.choices[0].message.content)The same endpoint also supports OpenAI’s audio.transcriptions.create API.
Web Demo
Two public demos are provided:
Gradio Demo (transformers backend)
Streaming Demo (vLLM backend with timestamp support)
Online links: HuggingFace Spaces and ModelScope studios.
Alibaba Cloud Bailei API
For users who prefer a managed service, Alibaba Cloud offers the qwen3-asr-flash-realtime API with pricing of 0.00033 CNY/s for mainland China and 0.00066 CNY/s internationally. The API supports multilingual high‑accuracy transcription, automatic language detection, non‑speech filtering, and emotion recognition.
Comparison with Other ASR Solutions
According to the paper’s evaluation, Qwen3‑ASR outperforms both commercial APIs (e.g., GPT‑4o‑Transcribe, Gemini‑2.5‑Pro, Doubao‑ASR) and open‑source models (Whisper‑large‑v3, FunASR‑MLT‑Nano, GLM‑ASR‑Nano) on public benchmarks such as LibriSpeech and WenetSpeech.
English : Qwen3‑ASR‑1.7B achieves the best results on diverse real‑world data and remains near‑state‑of‑the‑art on standard academic tests.
Chinese : Shows a clear advantage on noisy and meeting‑style datasets like WenetSpeech.
Chinese Dialects : Maintains high accuracy on Cantonese and other dialects, especially for long‑audio scenarios.
Pros (According to the Author)
True multilingual support (52 languages, 22 Chinese dialects).
Outstanding performance (RTF 0.064, 2000× realtime throughput).
High open‑source completeness (weights, inference framework, fine‑tuning scripts, Docker image).
Day‑0 vLLM support enables seamless integration into existing LLM services.
Ability to transcribe singing with background music.
Potential Improvements
Model size: even the smallest 0.6B model (≈9 B parameters) may be too large for edge devices.
Timestamp prediction requires loading an additional ForcedAligner model, increasing deployment complexity.
Emotion recognition is only available in the cloud API; the open‑source version lacks this feature.
Resources
GitHub: https://github.com/QwenLM/Qwen3-ASR
HuggingFace collection: https://huggingface.co/collections/Qwen/qwen3-asr
ModelScope collection: https://www.modelscope.cn/collections/Qwen/Qwen3-ASR
Paper (arXiv): https://arxiv.org/abs/2601.21337
Online demos: https://huggingface.co/spaces/Qwen/Qwen3-ASR and https://modelscope.cn/studios/Qwen/Qwen3-ASR
Alibaba Cloud API docs: https://help.aliyun.com/zh/model-studio/qwen-real-time-speech-recognition
Conclusion
Qwen3‑ASR represents a major step forward for open‑source speech recognition, delivering extensive language coverage, high‑speed inference, and a full toolchain for deployment. It is a compelling choice for developers needing robust ASR, especially for Chinese dialects, and its open‑source nature invites further community‑driven optimization.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Old Zhang's AI Learning
AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
