How Xiaomi’s MiMo V2.5 Achieves 99% API Price Cut with Full‑Stack Inference Optimizations

The MiMo‑V2.5 series combines Hybrid Sliding‑Window Attention, Mixture‑of‑Experts and multimodal support with a complete redesign of KVCache management, tiered caching, prefix‑tree logic and scheduling, compressing KVCache to about one‑seventh of full‑attention models and delivering up to 40% faster Prefill, 30% lower TTFT and dramatically reduced inference costs that enable a 99% API price reduction.

Xiaomi Tech
Xiaomi Tech
Xiaomi Tech
How Xiaomi’s MiMo V2.5 Achieves 99% API Price Cut with Full‑Stack Inference Optimizations

Why Hybrid SWA?

Large‑model inference is dominated by KVCache memory: each generated token requires the entire history to be stored as key‑value pairs in GPU memory, so longer contexts increase cache size, reduce concurrent requests and raise per‑token cost. MiMo‑V2.5‑Pro breaks this constraint by using a hybrid architecture: only 10 of the 70 Transformer layers employ Full Attention, while the remaining 60 use Sliding‑Window Attention (window size 128 tokens). This reduces overall KVCache storage to roughly 1/7 of a pure Full‑Attention design.

Because the SWA layers also limit attention computation to the window, Prefill computation drops to about 1/7 of the original cost, and Decode latency, which correlates with KVCache reads, benefits proportionally in long‑sequence scenarios.

KVCache System Reconstruction

To realize the theoretical gains, the KVCache manager was split into two independent pools:

Full KV Pool : grows on demand and retains long‑term entries.

SWA KV Pool : sized to the window, implemented as a circular buffer with O(W) capacity and window‑aware eviction.

This dual‑pool design yields an ~7× improvement in cache‑capacity efficiency, and SWA‑layer prefetch can overlap at layer‑wise granularity, making cache reads almost cost‑free.

Prefix‑Tree Refactor

Traditional prefix caches assume "identical token sequence ⇒ identical KV", which fails under SWA because the physical lifetime of SWA KV differs from the logical token sequence. The prefix tree was upgraded in three ways:

Match rule changed to “window‑safe length” (at least W tokens remain valid).

Eviction paths bound to request lifetimes, keeping the SWA pool constant at window scale.

Each node now holds both Full‑Attention indices and SWA mappings, enabling independent eviction.

Online hit rate averages 93% , exceeding 95% for high‑frequency users.

GCache Three‑Level Tiered Cache

MiMo’s in‑house GCache spans GPU memory, CPU RAM and NVMe SSD. KVCache entries flow automatically based on hotness: hot data stays in GPU, colder data migrates to RAM or SSD, and is quickly restored when needed. GCache runs on GPU nodes, mixes local memory with attached SSD at zero extra storage cost, and achieves 170 GB/s read throughput with 280 µs latency via RDMA. Combined with the reduced SWA cache size, this multiplies the effective cache volume and dramatically raises hit rates.

Scheduling and Prefill Optimizations

Even after freeing memory, without proper scheduling the saved resources remain unused. The system introduces:

KVCache‑affinity scheduling : routes requests to nodes that already contain the required prefix, improving L2 cache hit rate by ~25%.

Computation‑aware priority : prefers requests with fewer remaining tokens, with a penalty to avoid starvation, reducing TTFT P90 by 30% .

Length‑bucketed Prefill : three buckets (0‑64K, 64K‑256K, 256K‑1M) group similar‑length requests, preventing short requests from being blocked by long ones and boosting average Prefill throughput by ~40% .

Decode Acceleration and Multimodal Parallelism

During Decode, KVCache still dominates memory usage. With full SWA support, effective KVCache capacity grows nearly 5× . Additional techniques include:

CUDA‑Graph memory tuning and pre‑allocation in the PD split, increasing per‑node concurrency.

Native support for three‑layer Multi‑Token Prediction (MTP), allowing the model to predict multiple tokens in parallel; in Prefill, the first 128 tokens see a 2.3× speed‑up, and tokens 128‑256 a 1.5× gain.

For multimodal workloads, the encoder processes visual, audio and video streams in parallel. Batch‑wise image/audio fusion reduces a 1‑hour video’s end‑to‑end latency from 156 s to 23 s**, and consistent‑hashing plus shared memory doubles encoder throughput.

Overall Impact

The combined optimizations—Hybrid SWA, MoE configuration, KVCache dual‑pool, prefix‑tree redesign, GCache tiering, affinity scheduling, length‑bucketed Prefill and MTP decoding—turn theoretical efficiency into real‑world production gains. Online inference now offers higher throughput, lower latency and reduced GPU memory pressure, enabling the MiMo‑V2.5 API price to be cut by up to 99% without sacrificing model capability.

Part of these improvements have been contributed back to the open‑source SGLang project via pull requests, and further open‑source plans are underway to lower engineering barriers for similar hybrid architectures.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Inference OptimizationMoEMultimodalKVCacheHybrid SWAMiMo V2.5
Xiaomi Tech
Written by

Xiaomi Tech

Chat about technology with Xiaomi and change life together.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.