MiniCPM‑o 4.5 Achieves Full‑Duplex Multimodal AI That DeepSeek V4 Missed

MiniCPM‑o 4.5 introduces the world’s first end‑to‑end full‑duplex multimodal 9‑billion‑parameter model, powered by the Omni‑Flow framework, running on a single consumer‑grade GPU with 12 GB memory, and delivers benchmark results that match or surpass Gemini 2.5 Flash while offering open‑source demos, APIs, and a Windows/macOS installer.

PaperAgent
PaperAgent
PaperAgent
MiniCPM‑o 4.5 Achieves Full‑Duplex Multimodal AI That DeepSeek V4 Missed

MiniCPM‑o 4.5 – End‑to‑end 9 B full‑duplex multimodal model

MiniCPM‑o 4.5 implements the Omni‑Flow streaming multimodal framework, aligning visual, audio, and textual streams on a millisecond‑level timeline. This enables continuous perception, reasoning, and response without external voice‑activity‑detection.

Architecture (9 B parameters)

Vision encoder (0.4 B): SigLIP‑ViT.

Audio encoder (0.3 B): Whisper‑Medium.

LLM base (8 B): Qwen3‑8B.

Voice token decoder (0.3 B): lightweight Llama converting text tokens to speech units.

Vocoder: synthesizes final waveform.

The LLM generates only textual tokens; a dedicated voice decoder handles speech synthesis, preserving language and reasoning capacity.

TAIL – Time‑Aligned Interleaving voice generation

TAIL synchronizes each speech chunk with its corresponding text chunk, avoiding large pre‑read buffers. A lightweight “pre‑look” mechanism ensures cross‑word continuity, achieving low‑delay, natural‑sounding speech.

Inference efficiency

The INT4‑quantized model runs with 11 GB GPU memory and reaches 212 tokens/s, >40 % faster than Qwen3‑Omni, with lower response latency.

Visual benchmark results

OpenCompass: 77.6 (MiniCPM‑o 4.5) vs 78.5 (Gemini 2.5 Flash) vs 75.7 (Qwen3‑Omni‑30B‑A3B).

MMBench EN v1.1: 87.6 vs 86.6 vs 84.9.

MathVista: 80.1 vs 75.3 vs 75.9.

HallusionBench: 63.2 vs 59.1 vs 59.7.

Full‑duplex multimodal benchmarks

Daily‑Omni: 80.2 (MiniCPM‑o 4.5) vs 79.3 (Gemini 2.5 Flash) vs 70.7 (Qwen3‑Omni).

Video‑Holmes: 64.29 vs 51.3 vs 50.4.

LiveSports‑3K‑CC win‑rate: 54.4 % (MiniCPM‑o 4.5); competing models report no result.

Speech quality

Character error rate (CER): 0.86 vs 1.45 (CosyVoice2) vs 1.41 (Qwen3‑Omni).

Word error rate (WER): 2.38 vs 2.57 vs 3.39.

Emotion score (Expresso): 29.8 vs 17.9 (CosyVoice2).

Key components of Omni‑Flow

Omni‑Flow creates a shared timeline that slices visual, audio, and language streams into millisecond‑level slots. In each slot the model performs a perception‑reasoning‑response cycle, enabling natural interruptions and eliminating reliance on external VAD.

Open resources

Technical report PDF: https://github.com/OpenBMB/MiniCPM-o/blob/main/docs/MiniCPM_o_45_technical_report.pdf

Demo repository (includes local installer): https://github.com/OpenBMB/MiniCPM-o-Demo

Model download (Hugging Face): https://huggingface.co/openbmb/MiniCPM-o-4_5

Model download (ModelScope): https://www.modelscope.cn/models/OpenBMB/MiniCPM-o-4_5

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AIOpen SourcebenchmarkMultimodalfull-duplexMiniCPM-o
PaperAgent
Written by

PaperAgent

Daily updates, analyzing cutting-edge AI research papers

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.