LM Studio Adds MTP Support, Boosting Qwen3.6‑35B to ~130 Tokens/s
LM Studio 0.4.14+ now implements Multi‑Token Prediction (MTP) speculative decoding, eliminating the need for a separate draft model and delivering roughly double the token throughput—e.g., Qwen3.6‑35B reaches about 130 tokens/s on RTX 3090—while providing a six‑step activation guide and a list of known pitfalls.
What Is MTP?
Multi‑Token Prediction (MTP) extends traditional language‑model decoding, which predicts one token at a time, by allowing the model to predict several tokens in a single forward pass. It is a form of speculative decoding that integrates a draft capability directly into the main model, avoiding an extra lightweight draft model.
Key Benefits
No extra draft model needed: the MTP head is trained together with the main model, providing built‑in draft ability.
Natural alignment: because predictions come from the same model, verification pass rates are high and speed‑up ratios are stable.
How to Enable MTP in LM Studio (Six Steps)
Upgrade LM Studio to version 0.4.14 Build 4 .
Open LM Studio and turn on Developer Mode .
In the model settings, ensure the llama.cpp engine version is 2.15.0 or newer.
Download a GGUF model that includes an MTP head, for example:
unsloth/Qwen3.6-35B-A3B-MTP-GGUF
unsloth/Qwen3.6-27B-MTP-GGUF
When loading the model, check the MTP option to enable it.
Enjoy roughly a 2× speed increase on supported hardware.
Measured Performance Gains
RTX 3090, Qwen3.6‑27B MTP: ~20.69 tok/s without MTP, ~42 tok/s with MTP → ~2.0× speedup.
High‑end configuration, Qwen3.6‑35B‑A3B MTP: ~130 tok/s (MTP enabled).
Known Issues (Pitfalls to Avoid)
Build 2 produced blank output characters – fixed in Build 3.
Non‑MTP speculative decoding crashes when MTP is enabled – fixed in Build 4.
Small models (≈4B) may experience reverse acceleration – recommend using larger models.
Gemma 4 MTP is currently unavailable – known bug.
MTP is disabled by default; beginners may miss the setting – must enable Dev Mode and select the MTP loading flag.
llama.cpp engine must be version 2.15.0 or higher – beta channel users may need to upgrade manually.
Comparison with Native llama.cpp
Native llama.cpp can be invoked with the ubatch option and other tuning parameters, offering potentially greater optimization space than LM Studio. Choose based on scenario:
Quick start, no hassle: LM Studio 0.4.14 + MTP – one‑click activation, ~2× speed boost.
Maximum performance: Use the native llama.cpp CLI, manually tune ubatch, n_gpu_layers, KV cache, etc.
Template fixes: Combine with froggeric patches (Qwen‑Fixed‑Chat‑Templates) for stable, higher‑performance inference on LM Studio.
Images in the original article illustrate the upgrade process and settings screenshots.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Old Zhang's AI Learning
AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
