Tagged articles

16 articles

Page 1 of 1

May 27, 2026 · Artificial Intelligence

Optimizing Large Model Inference Architecture for the Agent Era: Engineering Practices and Challenges

The article analyzes the architectural challenges of large‑model inference in the Agent era—such as memory‑intensive MLA structures, MoE communication overhead, exploding KV‑Cache size, and tool‑call accuracy—and presents a series of engineering solutions including hierarchical KV‑Cache pooling, sequence parallelism, offloading strategies, and chip‑level adaptations to achieve higher throughput and lower token costs.

AI InfraAgentDeepSeek

0 likes · 15 min read

Optimizing Large Model Inference Architecture for the Agent Era: Engineering Practices and Challenges

Old Zhang's AI Learning

May 24, 2026 · Artificial Intelligence

LM Studio Adds MTP Support, Boosting Qwen3.6‑35B to ~130 Tokens/s

LM Studio 0.4.14+ now implements Multi‑Token Prediction (MTP) speculative decoding, eliminating the need for a separate draft model and delivering roughly double the token throughput—e.g., Qwen3.6‑35B reaches about 130 tokens/s on RTX 3090—while providing a six‑step activation guide and a list of known pitfalls.

LM StudioMTPQwen3.6

0 likes · 6 min read

LM Studio Adds MTP Support, Boosting Qwen3.6‑35B to ~130 Tokens/s

Old Zhang's AI Learning

May 14, 2026 · Artificial Intelligence

Boost Qwen3.6 with MTP: 1.5× Faster Local Deployment for Claude Code

The article explains how to enable Multi‑Token Prediction (MTP) in Qwen3.6 using a specific llama.cpp PR, achieving up to 1.5× faster local inference, details compilation steps, optimal parameters, memory requirements, and how to integrate the accelerated model with Claude Code while avoiding common pitfalls.

Claude CodeLLM accelerationMTP

0 likes · 11 min read

Boost Qwen3.6 with MTP: 1.5× Faster Local Deployment for Claude Code

Old Zhang's AI Learning

May 12, 2026 · Artificial Intelligence

How Unsloth’s MTP Boosts Qwen3.6 Inference Speed on Consumer GPUs

Unsloth adds MTP to Qwen3.6‑27B and 35B‑A3B models, delivering 1.5‑2× decoding speed gains on consumer‑grade GPUs, with ~80% draft acceptance, while providing installation steps, usage parameters, benchmark results, and guidance on suitable scenarios.

GGUFGPULocal Inference

0 likes · 9 min read

How Unsloth’s MTP Boosts Qwen3.6 Inference Speed on Consumer GPUs

Lao Guo's Learning Space

May 7, 2026 · Artificial Intelligence

Gemma 4 MTP Deep Dive: Speculative Decoding & KV‑Cache Sharing for 3× Faster Inference

The article explains why large‑language‑model inference is bottlenecked by memory‑bandwidth, then details Google’s Gemma 4 MTP technique—using a small draft model with speculative decoding and shared KV‑Cache—to parallelize token prediction, achieving up to three‑fold speed gains without any loss in output quality, and provides step‑by‑step local deployment instructions.

Gemma 4Inference OptimizationKV Cache

0 likes · 11 min read

Gemma 4 MTP Deep Dive: Speculative Decoding & KV‑Cache Sharing for 3× Faster Inference

Old Zhang's AI Learning

Apr 18, 2026 · Artificial Intelligence

NVIDIA Nemotron 3 Super: 7× Faster Than Qwen3.5 – Inside Hybrid Mamba‑Attention, LatentMoE, and MTP

NVIDIA’s Nemotron 3 Super, a 120.6 B‑parameter flagship model supporting 1 M‑token context, combines Hybrid Mamba‑Attention, LatentMoE, and Multi‑Token Prediction to achieve up to 7.5× higher inference throughput than Qwen3.5 while matching or surpassing its accuracy across a range of benchmarks.

Hybrid Mamba-AttentionLarge Language ModelLatentMoE

0 likes · 11 min read

NVIDIA Nemotron 3 Super: 7× Faster Than Qwen3.5 – Inside Hybrid Mamba‑Attention, LatentMoE, and MTP

AI Insight Log

Dec 18, 2025 · Artificial Intelligence

Xiaomi’s New MiMo‑V2‑Flash LLM Rivals DeepSeek‑V3.2 and Near‑GPT‑5 High

Xiaomi’s MiMo‑V2‑Flash, a 309B‑parameter MoE LLM with only 15B active weights, uses Hybrid SWA, Multi‑Token Prediction and Multi‑Teacher On‑Policy Distillation to cut KV‑cache by six times, boost inference speed 2.6×, and achieve performance comparable to DeepSeek‑V3.2, Kimi‑K2 and near‑GPT‑5 High, including a 73.4% SWE‑Bench code‑agent score.

Hybrid SWALarge Language ModelMOPD

0 likes · 7 min read

Xiaomi’s New MiMo‑V2‑Flash LLM Rivals DeepSeek‑V3.2 and Near‑GPT‑5 High

Xiaomi Tech

Dec 17, 2025 · Artificial Intelligence

Xiaomi MiMo-V2-Flash Open‑Source: Ultra‑Efficient Inference and Agent‑Ready Model

Xiaomi's MiMo-V2-Flash, a 309B MoE model with hybrid attention and Multi‑Token Prediction acceleration, delivers top‑2 global agent benchmark scores, up to 2× faster inference, and only 2.5% of the cost of comparable closed‑source models, while being fully open‑source.

Efficient InferenceHybrid AttentionMOPD

0 likes · 7 min read

Xiaomi MiMo-V2-Flash Open‑Source: Ultra‑Efficient Inference and Agent‑Ready Model

Baidu Intelligent Cloud Tech Hub

Oct 28, 2025 · Artificial Intelligence

How Baidu’s New MTP Inference Code Doubles DeepSeek‑V3.2 Throughput

Baidu Baige and the SGLang community have open‑sourced a production‑tested MTP inference engine that boosts DeepSeek‑V3.2 decoding speed by over two times while delivering exceptional stability, thanks to a DSA‑optimized architecture that predicts multiple tokens in a single forward pass.

AIDSADeepSeek

0 likes · 4 min read

How Baidu’s New MTP Inference Code Doubles DeepSeek‑V3.2 Throughput

Data Party THU

Sep 21, 2025 · Artificial Intelligence

Building a Mini‑DeepSeek‑V3: Transformer Block and MTP Implementation on Limited Compute

This article walks through the design and implementation of a Mini‑DeepSeek‑V3 language model, detailing how to assemble the core Transformer block, integrate Multi‑Token Prediction (MTP) modules, construct the overall architecture, and compute the combined loss—all using modest GPU resources and a single‑card or DDP training setup.

AIDeepSeekMTP

0 likes · 12 min read

Building a Mini‑DeepSeek‑V3: Transformer Block and MTP Implementation on Limited Compute

Tencent Technical Engineering

Jul 11, 2025 · Artificial Intelligence

How DeepSeek Achieved 15,800+ Tokens/s: Full‑Stack Inference Optimizations

This article details the Angel‑HCF team's end‑to‑end DeepSeek inference optimizations—including PD separation, multi‑layer MTP, EP and DP parallelism, hardware‑aware kernels, and load‑balancing strategies—that boost throughput to over 15,800 tokens per second while keeping per‑token latency under 50 ms.

AI PerformanceDeepSeekGPU utilization

0 likes · 13 min read

How DeepSeek Achieved 15,800+ Tokens/s: Full‑Stack Inference Optimizations

Baidu Tech Salon

Mar 13, 2025 · Artificial Intelligence

How PaddlePaddle 3.0 Boosts Large‑Model Inference with 4‑Bit Quantization and MLA Optimizations

PaddlePaddle 3.0 introduces a full‑stack inference engine that supports FP8, INT8, and 4‑bit quantization for popular LLMs such as DeepSeek V3/R1, delivers up to 2× token throughput on a single H800 GPU, and provides detailed deployment scripts for single‑node and multi‑node setups, including MTP speculative decoding and SageAttention for long‑sequence acceleration.

DockerInference OptimizationMLA

0 likes · 13 min read

How PaddlePaddle 3.0 Boosts Large‑Model Inference with 4‑Bit Quantization and MLA Optimizations

AI Algorithm Path

Feb 9, 2025 · Artificial Intelligence

Understanding Multi-Token Prediction in DeepSeek‑R1 Architecture

This article dissects the Multi‑Token Prediction (MTP) technique used in DeepSeek‑R1, contrasting it with traditional next‑token prediction, detailing Meta’s MTP design, DeepSeek’s adapted architecture, loss weighting, and why MTP is applied only during training to boost efficiency and model capability.

DeepSeekMTPModel architecture

0 likes · 9 min read

Understanding Multi-Token Prediction in DeepSeek‑R1 Architecture

Baobao Algorithm Notes

Jan 15, 2025 · Artificial Intelligence

How Multi-Token Prediction Boosts LLM Training and Inference Efficiency

This article reviews the evolution of Multi‑Token Prediction (MTP) techniques—from early blockwise parallel decoding to Meta's and DeepSeek's implementations—explaining their architectures, training and inference workflows, and the speed‑up gains they offer for large language models.

DeepSeekInference AccelerationLLM

0 likes · 20 min read

How Multi-Token Prediction Boosts LLM Training and Inference Efficiency

OPPO Kernel Craftsman

Feb 14, 2020 · Fundamentals

Comprehensive Guide to USB Protocol and Linux USB Driver Architecture

This guide thoroughly explains USB technology and its Linux implementation, covering fundamentals, transmission modes, descriptor structures, enumeration flow, gadget driver architecture with MTP details, and host driver mechanisms such as URBs, mouse and storage drivers, plus references for further study.

Linux driver developmentMTPNRZI encoding

0 likes · 12 min read

Comprehensive Guide to USB Protocol and Linux USB Driver Architecture

Ctrip Technology

Feb 7, 2018 · Mobile Development

Ctrip's Mobile Tech Platform (MTP) and Mobile Continuous Delivery (MCD): Design, Implementation, and Outcomes

In 2017 Ctrip reorganized its wireless engineering to adopt a lifecycle‑driven, platform‑based approach, introducing the Mobile Tech Platform (MTP) and Mobile Continuous Delivery (MCD) platforms that unified component services, development frameworks, and automated build‑release pipelines for over 20+ apps, dramatically improving efficiency and quality.

Continuous DeliveryCtripMCD

0 likes · 9 min read

Ctrip's Mobile Tech Platform (MTP) and Mobile Continuous Delivery (MCD): Design, Implementation, and Outcomes