MiniCPM‑o 4.5 Achieves Full‑Duplex Multimodal AI That DeepSeek V4 Missed
MiniCPM‑o 4.5 introduces the world’s first end‑to‑end full‑duplex multimodal 9‑billion‑parameter model, powered by the Omni‑Flow framework, running on a single consumer‑grade GPU with 12 GB memory, and delivers benchmark results that match or surpass Gemini 2.5 Flash while offering open‑source demos, APIs, and a Windows/macOS installer.
MiniCPM‑o 4.5 – End‑to‑end 9 B full‑duplex multimodal model
MiniCPM‑o 4.5 implements the Omni‑Flow streaming multimodal framework, aligning visual, audio, and textual streams on a millisecond‑level timeline. This enables continuous perception, reasoning, and response without external voice‑activity‑detection.
Architecture (9 B parameters)
Vision encoder (0.4 B): SigLIP‑ViT.
Audio encoder (0.3 B): Whisper‑Medium.
LLM base (8 B): Qwen3‑8B.
Voice token decoder (0.3 B): lightweight Llama converting text tokens to speech units.
Vocoder: synthesizes final waveform.
The LLM generates only textual tokens; a dedicated voice decoder handles speech synthesis, preserving language and reasoning capacity.
TAIL – Time‑Aligned Interleaving voice generation
TAIL synchronizes each speech chunk with its corresponding text chunk, avoiding large pre‑read buffers. A lightweight “pre‑look” mechanism ensures cross‑word continuity, achieving low‑delay, natural‑sounding speech.
Inference efficiency
The INT4‑quantized model runs with 11 GB GPU memory and reaches 212 tokens/s, >40 % faster than Qwen3‑Omni, with lower response latency.
Visual benchmark results
OpenCompass: 77.6 (MiniCPM‑o 4.5) vs 78.5 (Gemini 2.5 Flash) vs 75.7 (Qwen3‑Omni‑30B‑A3B).
MMBench EN v1.1: 87.6 vs 86.6 vs 84.9.
MathVista: 80.1 vs 75.3 vs 75.9.
HallusionBench: 63.2 vs 59.1 vs 59.7.
Full‑duplex multimodal benchmarks
Daily‑Omni: 80.2 (MiniCPM‑o 4.5) vs 79.3 (Gemini 2.5 Flash) vs 70.7 (Qwen3‑Omni).
Video‑Holmes: 64.29 vs 51.3 vs 50.4.
LiveSports‑3K‑CC win‑rate: 54.4 % (MiniCPM‑o 4.5); competing models report no result.
Speech quality
Character error rate (CER): 0.86 vs 1.45 (CosyVoice2) vs 1.41 (Qwen3‑Omni).
Word error rate (WER): 2.38 vs 2.57 vs 3.39.
Emotion score (Expresso): 29.8 vs 17.9 (CosyVoice2).
Key components of Omni‑Flow
Omni‑Flow creates a shared timeline that slices visual, audio, and language streams into millisecond‑level slots. In each slot the model performs a perception‑reasoning‑response cycle, enabling natural interruptions and eliminating reliance on external VAD.
Open resources
Technical report PDF: https://github.com/OpenBMB/MiniCPM-o/blob/main/docs/MiniCPM_o_45_technical_report.pdf
Demo repository (includes local installer): https://github.com/OpenBMB/MiniCPM-o-Demo
Model download (Hugging Face): https://huggingface.co/openbmb/MiniCPM-o-4_5
Model download (ModelScope): https://www.modelscope.cn/models/OpenBMB/MiniCPM-o-4_5
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
