LM Studio Adds MTP Support, Boosting Qwen3.6‑35B to ~130 Tokens/s

LM Studio 0.4.14+ now implements Multi‑Token Prediction (MTP) speculative decoding, eliminating the need for a separate draft model and delivering roughly double the token throughput—e.g., Qwen3.6‑35B reaches about 130 tokens/s on RTX 3090—while providing a six‑step activation guide and a list of known pitfalls.

Old Zhang's AI Learning
Old Zhang's AI Learning
Old Zhang's AI Learning
LM Studio Adds MTP Support, Boosting Qwen3.6‑35B to ~130 Tokens/s

What Is MTP?

Multi‑Token Prediction (MTP) extends traditional language‑model decoding, which predicts one token at a time, by allowing the model to predict several tokens in a single forward pass. It is a form of speculative decoding that integrates a draft capability directly into the main model, avoiding an extra lightweight draft model.

Key Benefits

No extra draft model needed: the MTP head is trained together with the main model, providing built‑in draft ability.

Natural alignment: because predictions come from the same model, verification pass rates are high and speed‑up ratios are stable.

How to Enable MTP in LM Studio (Six Steps)

Upgrade LM Studio to version 0.4.14 Build 4 .

Open LM Studio and turn on Developer Mode .

In the model settings, ensure the llama.cpp engine version is 2.15.0 or newer.

Download a GGUF model that includes an MTP head, for example:

unsloth/Qwen3.6-35B-A3B-MTP-GGUF

unsloth/Qwen3.6-27B-MTP-GGUF

When loading the model, check the MTP option to enable it.

Enjoy roughly a 2× speed increase on supported hardware.

Measured Performance Gains

RTX 3090, Qwen3.6‑27B MTP: ~20.69 tok/s without MTP, ~42 tok/s with MTP → ~2.0× speedup.

High‑end configuration, Qwen3.6‑35B‑A3B MTP: ~130 tok/s (MTP enabled).

Known Issues (Pitfalls to Avoid)

Build 2 produced blank output characters – fixed in Build 3.

Non‑MTP speculative decoding crashes when MTP is enabled – fixed in Build 4.

Small models (≈4B) may experience reverse acceleration – recommend using larger models.

Gemma 4 MTP is currently unavailable – known bug.

MTP is disabled by default; beginners may miss the setting – must enable Dev Mode and select the MTP loading flag.

llama.cpp engine must be version 2.15.0 or higher – beta channel users may need to upgrade manually.

Comparison with Native llama.cpp

Native llama.cpp can be invoked with the ubatch option and other tuning parameters, offering potentially greater optimization space than LM Studio. Choose based on scenario:

Quick start, no hassle: LM Studio 0.4.14 + MTP – one‑click activation, ~2× speed boost.

Maximum performance: Use the native llama.cpp CLI, manually tune ubatch, n_gpu_layers, KV cache, etc.

Template fixes: Combine with froggeric patches (Qwen‑Fixed‑Chat‑Templates) for stable, higher‑performance inference on LM Studio.

Images in the original article illustrate the upgrade process and settings screenshots.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Speculative DecodingMTPllama.cppLM StudioQwen3.6
Old Zhang's AI Learning
Written by

Old Zhang's AI Learning

AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.