Optimizing Large Model Inference Architecture for the Agent Era: Engineering Practices and Challenges
The article analyzes the architectural challenges of large‑model inference in the Agent era—such as memory‑intensive MLA structures, MoE communication overhead, exploding KV‑Cache size, and tool‑call accuracy—and presents a series of engineering solutions including hierarchical KV‑Cache pooling, sequence parallelism, offloading strategies, and chip‑level adaptations to achieve higher throughput and lower token costs.
In early 2025 Baidu released DeepSeek‑R1, an open‑source large model that introduced an MLA memory layout, drastically reducing GPU memory usage but requiring custom communication operators and KV‑Cache handling. The model also adopted a Mixture‑of‑Experts (MoE) architecture, which cuts activation memory but adds expert‑parallel communication overhead.
Later in 2025 the industry saw a shift toward Agent‑centric applications, where input contexts grew from the traditional 4K–128K range to 40K, 60K, and even 100K tokens. This caused a quadratic increase in Transformer attention cost (e.g., a 10× longer sequence multiplies compute by 100×) and a massive expansion of KV‑Cache memory, limiting concurrency and decoder throughput.
To address these issues, Baidu Baige implemented several system‑level optimizations:
Hierarchical KV‑Cache pooling : a two‑level cache stores frequently accessed tokens in high‑speed memory while offloading the rest to SSD or host memory, achieving a KV‑Cache hit rate close to 90% and reducing repeated token costs.
Sequence parallelism (CP) : long sequences are split into twice‑the‑GPU‑count slices; the first half is assigned in order, the second half in reverse order to balance load. Index data is synchronized across GPUs with a single AllGather, eliminating costly multi‑round send‑recv.
All‑GPU load balancing : by distributing heavier later‑sequence slices to GPUs that processed lighter early slices, overall GPU utilization becomes balanced.
Offload and Overlap strategies : top‑K tokens (e.g., 2K) are kept in HBM for fast reuse, while the rest are streamed from memory/SSD. Overlap allows computation to start before the entire token batch is loaded.
DSA‑based decoder : the decoder computes only a limited token window (1K–2K) per step, avoiding full‑sequence attention and enabling KV‑Cache offload to host memory, which multiplies concurrency.
Tool‑call accuracy : constrained decoding and optimized sampling improve the success rate of tool invocation, reducing token waste and task failures.
Chip adaptation : a vLLM‑Kunlun device‑plugin abstracts hardware specifics, allowing models to run on Kunlun chips without modifying the main vLLM codebase. This reduces adaptation cost and brings GPU‑level precision and performance to domestic chips.
Additional fine‑grained optimizations include eliminating schedule overhead via asynchronous launch, using UVA operators for small‑block I/O, and applying overlap token loading in the decode stage. These combined improvements lowered inference latency, increased throughput by several folds, and cut token consumption dramatically.
Beyond system optimizations, Baige contributed PRs to open‑source projects (vLLM‑Kunlun, SGLang) to share the adaptations and enable broader community use of the techniques.
Overall, the presented engineering practices demonstrate how to scale inference services for ever‑longer contexts and Agent workloads while keeping memory, compute, and token costs under control.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Baidu Intelligent Cloud Tech Hub
We share the cloud tech topics you care about. Feel free to leave a message and tell us what you'd like to learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
