Artificial Intelligence 15 min read

Qwen3-ASR: Open‑Source Speech Recognition Supporting 52 Languages and Dialects, Outperforming Whisper

The Qwen3‑ASR series, now open‑sourced by Alibaba, offers three models (1.7B, 0.6B, and a 0.6B forced aligner) that cover 52 languages and 22 Chinese dialects, support streaming and offline inference, achieve an RTF of 0.064 with 2000× realtime throughput, handle singing with background music, and provide detailed deployment guides, benchmarks, and comparisons with other ASR solutions.

Old Zhang's AI Learning

Jan 30, 2026

Qwen3-ASR: Open‑Source Speech Recognition Supporting 52 Languages and Dialects, Outperforming Whisper

Overview

The Qwen team has released the Qwen3‑ASR family as open‑source speech‑recognition models. The series consists of three models: Qwen3‑ASR‑1.7B (≈20 B parameters, flagship accuracy), Qwen3‑ASR‑0.6B (≈9 B parameters, balanced speed and accuracy), and Qwen3‑ForcedAligner‑0.6B (≈9 B parameters, provides timestamp alignment). Together they form a complete ASR pipeline from transcription to time‑stamp annotation.

Core Features

1. All‑in‑One Multilingual Support

The models recognize 52 languages, including 30 international languages (e.g., English, Arabic, French, Spanish, Japanese, Korean, Russian, etc.) and 22 Chinese dialects such as Mandarin, Cantonese (HK and Guangdong), Shanghainese, Sichuanese, and many regional varieties.

2. Extreme Performance

RTF (real‑time factor) as low as 0.064 means the 0.6B model can process roughly 15 seconds of audio per second. Under 128 concurrent requests the throughput reaches 2000 seconds of audio per second (3210 ms latency). Official benchmark data:

Concurrency 1: RTF 0.0094, throughput 106, TTFT 92 ms

Concurrency 8: RTF 0.0147, throughput 543, TTFT 228 ms

Concurrency 32: RTF 0.0291, throughput 1099, TTFT 820 ms

Concurrency 128: RTF 0.0640, throughput 2000 , TTFT 3210 ms

3. Singing and BGM Recognition

Both models can transcribe singing voice and songs with background music, a scenario that traditionally degrades ASR performance. The improvement is attributed to large‑scale training data and reinforcement‑learning fine‑tuning.

4. Unified Streaming + Offline Inference

A single model can run in streaming mode for low‑latency use cases (e.g., subtitles, voice assistants) or offline mode for long audio up to 20 minutes per segment.

Architecture Design

Qwen3‑ASR builds on the Qwen3‑Omni foundation model and introduces the Audio Transformer (AuT) encoder:

AuT encoder downsamples FBank features by 8×, producing 12.5 Hz audio tokens.

Dynamic Flash attention window adjusts from 1 s to 8 s, supporting both streaming and offline inference.

Qwen3 LM serves as a powerful language model decoder, enabling multilingual understanding.

The training pipeline consists of four stages:

AuT pre‑training on ~40 million hours of pseudo‑labeled ASR data.

Omni multimodal pre‑training on 3 trillion tokens.

Supervised ASR fine‑tuning with multilingual data for style transfer.

ASR reinforcement learning (GSPO) to improve noise robustness and transcription stability.

Quick Start

Environment Installation

# Create virtual environment
conda create -n qwen3-asr python=3.12 -y
conda activate qwen3-asr

# Core installation (transformers backend)
pip install -U qwen-asr

# Optional vLLM backend (recommended for speed)
pip install -U qwen-asr[vllm]

# Install FlashAttention2 for further acceleration
pip install -U flash-attn --no-build-isolation

Basic Usage

import torch
from qwen_asr import Qwen3ASRModel

# Load the 1.7B model
model = Qwen3ASRModel.from_pretrained(
    "Qwen/Qwen3-ASR-1.7B",
    dtype=torch.bfloat16,
    device_map="cuda:0",
    max_inference_batch_size=32,
    max_new_tokens=256,
)

# Transcribe an audio file (language auto‑detected)
results = model.transcribe(
    audio="https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_en.wav",
    language=None,
)
print(results[0].language)  # -> English
print(results[0].text)

Timestamp‑Enabled Transcription

model = Qwen3ASRModel.from_pretrained(
    "Qwen/Qwen3-ASR-1.7B",
    dtype=torch.bfloat16,
    device_map="cuda:0",
    forced_aligner="Qwen/Qwen3-ForcedAligner-0.6B",
    forced_aligner_kwargs={"dtype": torch.bfloat16, "device_map": "cuda:0"},
)

results = model.transcribe(
    audio=[".../asr_zh.wav", ".../asr_en.wav"],
    language=["Chinese", "English"],
    return_time_stamps=True,
)
for r in results:
    print(r.language, r.text, r.time_stamps[0])

vLLM Backend (Faster Inference)

if __name__ == "__main__":
    model = Qwen3ASRModel.LLM(
        model="Qwen/Qwen3-ASR-1.7B",
        gpu_memory_utilization=0.7,
        max_inference_batch_size=128,
        max_new_tokens=4096,
        forced_aligner="Qwen/Qwen3-ForcedAligner-0.6B",
        forced_aligner_kwargs={"dtype": torch.bfloat16, "device_map": "cuda:0"},
    )
    results = model.transcribe(audio=["audio1.wav", "audio2.wav"], language=None, return_time_stamps=True)
    for r in results:
        print(r.language, r.text)

vLLM Deployment

Deploy with a single command: vllm serve Qwen/Qwen3-ASR-1.7B Or use the official wrapper:

qwen-asr-serve Qwen/Qwen3-ASR-1.7B \
    --gpu-memory-utilization 0.8 \
    --host 0.0.0.0 \
    --port 8000

OpenAI‑compatible API

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
response = client.chat.completions.create(
    model="Qwen/Qwen3-ASR-1.7B",
    messages=[{"role": "user", "content": [{"type": "audio_url", "audio_url": {"url": "https://.../asr_en.wav"}}]}],
)
print(response.choices[0].message.content)

The same endpoint also supports OpenAI’s audio.transcriptions.create API.

Web Demo

Two public demos are provided:

Gradio Demo (transformers backend)

Streaming Demo (vLLM backend with timestamp support)

Online links: HuggingFace Spaces and ModelScope studios.

Alibaba Cloud Bailei API

For users who prefer a managed service, Alibaba Cloud offers the qwen3-asr-flash-realtime API with pricing of 0.00033 CNY/s for mainland China and 0.00066 CNY/s internationally. The API supports multilingual high‑accuracy transcription, automatic language detection, non‑speech filtering, and emotion recognition.

Comparison with Other ASR Solutions

According to the paper’s evaluation, Qwen3‑ASR outperforms both commercial APIs (e.g., GPT‑4o‑Transcribe, Gemini‑2.5‑Pro, Doubao‑ASR) and open‑source models (Whisper‑large‑v3, FunASR‑MLT‑Nano, GLM‑ASR‑Nano) on public benchmarks such as LibriSpeech and WenetSpeech.

English : Qwen3‑ASR‑1.7B achieves the best results on diverse real‑world data and remains near‑state‑of‑the‑art on standard academic tests.

Chinese : Shows a clear advantage on noisy and meeting‑style datasets like WenetSpeech.

Chinese Dialects : Maintains high accuracy on Cantonese and other dialects, especially for long‑audio scenarios.

Pros (According to the Author)

True multilingual support (52 languages, 22 Chinese dialects).

Outstanding performance (RTF 0.064, 2000× realtime throughput).

High open‑source completeness (weights, inference framework, fine‑tuning scripts, Docker image).

Day‑0 vLLM support enables seamless integration into existing LLM services.

Ability to transcribe singing with background music.

Potential Improvements

Model size: even the smallest 0.6B model (≈9 B parameters) may be too large for edge devices.

Timestamp prediction requires loading an additional ForcedAligner model, increasing deployment complexity.

Emotion recognition is only available in the cloud API; the open‑source version lacks this feature.

Resources

GitHub: https://github.com/QwenLM/Qwen3-ASR

HuggingFace collection: https://huggingface.co/collections/Qwen/qwen3-asr

ModelScope collection: https://www.modelscope.cn/collections/Qwen/Qwen3-ASR

Paper (arXiv): https://arxiv.org/abs/2601.21337

Online demos: https://huggingface.co/spaces/Qwen/Qwen3-ASR and https://modelscope.cn/studios/Qwen/Qwen3-ASR

Alibaba Cloud API docs: https://help.aliyun.com/zh/model-studio/qwen-real-time-speech-recognition

Conclusion

Qwen3‑ASR represents a major step forward for open‑source speech recognition, delivering extensive language coverage, high‑speed inference, and a full toolchain for deployment. It is a compelling choice for developers needing robust ASR, especially for Chinese dialects, and its open‑source nature invites further community‑driven optimization.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

vLLM real-time inference forced aligner multilingual speech recognition open-source ASR Qwen3-ASR

Written by

Old Zhang's AI Learning

AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Overview

Core Features

1. All‑in‑One Multilingual Support

2. Extreme Performance

3. Singing and BGM Recognition

4. Unified Streaming + Offline Inference

Architecture Design

Quick Start

Environment Installation

Basic Usage

Timestamp‑Enabled Transcription

vLLM Backend (Faster Inference)

vLLM Deployment

OpenAI‑compatible API

Web Demo

Alibaba Cloud Bailei API

Comparison with Other ASR Solutions

Pros (According to the Author)

Potential Improvements

Resources

Conclusion

Old Zhang's AI Learning

How this landed with the community

Was this worth your time?

0 Comments

4. Unified Streaming + Offline Inference