Artificial Intelligence 13 min read

MiniCPM‑o 4.5 Achieves Full‑Duplex Multimodal AI That DeepSeek V4 Missed

MiniCPM‑o 4.5 introduces the world’s first end‑to‑end full‑duplex multimodal 9‑billion‑parameter model, powered by the Omni‑Flow framework, running on a single consumer‑grade GPU with 12 GB memory, and delivers benchmark results that match or surpass Gemini 2.5 Flash while offering open‑source demos, APIs, and a Windows/macOS installer.

PaperAgent

Apr 28, 2026

MiniCPM‑o 4.5 Achieves Full‑Duplex Multimodal AI That DeepSeek V4 Missed

MiniCPM‑o 4.5 – End‑to‑end 9 B full‑duplex multimodal model

MiniCPM‑o 4.5 implements the Omni‑Flow streaming multimodal framework, aligning visual, audio, and textual streams on a millisecond‑level timeline. This enables continuous perception, reasoning, and response without external voice‑activity‑detection.

Architecture (9 B parameters)

Vision encoder (0.4 B): SigLIP‑ViT.

Audio encoder (0.3 B): Whisper‑Medium.

LLM base (8 B): Qwen3‑8B.

Voice token decoder (0.3 B): lightweight Llama converting text tokens to speech units.

Vocoder: synthesizes final waveform.

The LLM generates only textual tokens; a dedicated voice decoder handles speech synthesis, preserving language and reasoning capacity.

TAIL – Time‑Aligned Interleaving voice generation

TAIL synchronizes each speech chunk with its corresponding text chunk, avoiding large pre‑read buffers. A lightweight “pre‑look” mechanism ensures cross‑word continuity, achieving low‑delay, natural‑sounding speech.

Inference efficiency

The INT4‑quantized model runs with 11 GB GPU memory and reaches 212 tokens/s, >40 % faster than Qwen3‑Omni, with lower response latency.

Visual benchmark results

OpenCompass: 77.6 (MiniCPM‑o 4.5) vs 78.5 (Gemini 2.5 Flash) vs 75.7 (Qwen3‑Omni‑30B‑A3B).

MMBench EN v1.1: 87.6 vs 86.6 vs 84.9.

MathVista: 80.1 vs 75.3 vs 75.9.

HallusionBench: 63.2 vs 59.1 vs 59.7.

Full‑duplex multimodal benchmarks

Daily‑Omni: 80.2 (MiniCPM‑o 4.5) vs 79.3 (Gemini 2.5 Flash) vs 70.7 (Qwen3‑Omni).

Video‑Holmes: 64.29 vs 51.3 vs 50.4.

LiveSports‑3K‑CC win‑rate: 54.4 % (MiniCPM‑o 4.5); competing models report no result.

Speech quality

Character error rate (CER): 0.86 vs 1.45 (CosyVoice2) vs 1.41 (Qwen3‑Omni).

Word error rate (WER): 2.38 vs 2.57 vs 3.39.

Emotion score (Expresso): 29.8 vs 17.9 (CosyVoice2).

Key components of Omni‑Flow

Omni‑Flow creates a shared timeline that slices visual, audio, and language streams into millisecond‑level slots. In each slot the model performs a perception‑reasoning‑response cycle, enabling natural interruptions and eliminating reliance on external VAD.

Open resources

Technical report PDF: https://github.com/OpenBMB/MiniCPM-o/blob/main/docs/MiniCPM_o_45_technical_report.pdf

Demo repository (includes local installer): https://github.com/OpenBMB/MiniCPM-o-Demo

Model download (Hugging Face): https://huggingface.co/openbmb/MiniCPM-o-4_5

Model download (ModelScope): https://www.modelscope.cn/models/OpenBMB/MiniCPM-o-4_5

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI Open Source benchmark Multimodal full-duplex MiniCPM-o

Written by

PaperAgent

Daily updates, analyzing cutting-edge AI research papers

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

MiniCPM‑o 4.5 – End‑to‑end 9 B full‑duplex multimodal model

Architecture (9 B parameters)

TAIL – Time‑Aligned Interleaving voice generation

Inference efficiency

Visual benchmark results

Full‑duplex multimodal benchmarks

Speech quality

Key components of Omni‑Flow

Open resources

PaperAgent

How this landed with the community

Was this worth your time?

0 Comments

MiniCPM‑o 4.5 – End‑to‑end 9 B full‑duplex multimodal model

Architecture (9 B parameters)