Qwen3.6-27B Open‑Source: How a 27B Dense Model Outperforms the 397B Giant
The newly released Qwen3.6-27B dense multimodal model, at just 27 B parameters, surpasses the 397 B flagship on most encoding benchmarks, offers up to 1 M token context, supports FP8 quantization, and can be deployed locally via vLLM, SGLang or Transformers with modest hardware.
Model Overview
Qwen3.6-27B is a 27 B dense multimodal model that surpasses the previous open‑source flagship Qwen3.5‑397B‑A17B on most coding benchmarks.
SWE‑bench Verified: 77.2 (vs 76.2 for 3.5‑397B)
SWE‑bench Pro: 53.5 (vs 50.9)
Terminal‑Bench 2.0: 59.3 (vs 52.5)
SkillsBench Avg5: 48.2 (vs 30.0)
GPQA Diamond: 87.8
AIME 2026: 94.1
Compared with the closed‑source Claude 4.5 Opus, the gap on coding benchmarks is 1‑5 points and Terminal‑Bench scores are identical (59.3).
Key Advantages
Agentic coding: Real‑world coding tasks, especially front‑end and repository‑level, outperform Claude.
Thinking preservation: Multi‑turn conversations keep reasoning context, avoiding repeated “thinking” in iterative coding.
Architecture
Parameters: 27 B dense (no MoE)
Layers: 64, hidden dimension 5120
Native context length: 262 144 tokens, extendable to 1 010 000 tokens
Hidden layout:
16 × (3 × (Gated DeltaNet → FFN) → 1 × (Gated Attention → FFN))Multimodal vision encoder supports images, video, and documents
Supports MTP (Multi‑Token Prediction) for inference speedup
Gated DeltaNet + Gated Attention mix is more memory‑friendly than pure attention for long contexts.
FP8 Quantized Version
The FP8 weight file is ~30 GB. Using the Qwen/Qwen3.6-27B-FP8 checkpoint halves memory usage while performance loss is reported as negligible.
Why 27 B Is a Sweet Spot
Easy deployment: Dense architecture works directly with vLLM or SGLang without expert parallelism.
Moderate hardware requirements: BF16 needs ~54 GB VRAM (e.g., 2 × A100 40 GB, 1 × H100 80 GB, or 4 × RTX 4090). FP8 needs ~27 GB (single 48 GB L40S/A6000).
No capability compromise: Benchmarks show it outperforms the 397 B model.
Fully open weights: Available on Hugging Face and ModelScope for unrestricted commercial use.
Local Deployment Options
Officially supported routes: vLLM, SGLang, and Hugging Face Transformers. KTransformers also supports CPU‑GPU heterogeneous inference.
vLLM Deployment (recommended)
uv pip install vllm --torch-backend=auto vllm serve Qwen/Qwen3.6-27B \
--port 8000 \
--tensor-parallel-size 8 \
--max-model-len 262144 \
--reasoning-parser qwen3With tool‑call (required for coding agents):
vllm serve Qwen/Qwen3.6-27B \
--port 8000 \
--tensor-parallel-size 8 \
--max-model-len 262144 \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coderEnable MTP (speculative decoding):
vllm serve Qwen/Qwen3.6-27B \
--port 8000 \
--tensor-parallel-size 8 \
--max-model-len 262144 \
--reasoning-parser qwen3 \
--speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'Text‑only mode (drops vision encoder):
vllm serve Qwen/Qwen3.6-27B \
--port 8000 \
--tensor-parallel-size 8 \
--max-model-len 262144 \
--reasoning-parser qwen3 \
--language-model-onlyOOM tip: If out‑of‑memory occurs, do not reduce context below 128 K; the model’s thinking ability degrades sharply.
SGLang Deployment
uv pip install sglang[all] python -m sglang.launch_server \
--model-path Qwen/Qwen3.6-27B \
--port 8000 \
--tp-size 8 \
--mem-fraction-static 0.8 \
--context-length 262144 \
--reasoning-parser qwen3With tool use:
python -m sglang.launch_server \
--model-path Qwen/Qwen3.6-27B \
--port 8000 \
--tp-size 8 \
--mem-fraction-static 0.8 \
--context-length 262144 \
--reasoning-parser qwen3 \
--tool-call-parser qwen3_coderEnable speculative MTP:
python -m sglang.launch_server \
--model-path Qwen/Qwen3.6-27B \
--port 8000 \
--tp-size 8 \
--mem-fraction-static 0.8 \
--context-length 262144 \
--reasoning-parser qwen3 \
--speculative-algo NEXTN \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4Transformers Lightweight Deployment (testing only)
pip install "transformers[serving]"
transformers serve Qwen/Qwen3.6-27B --port 8000 --continuous-batchingThis option is suitable for experiments; production should use vLLM or SGLang.
FP8 Quantized Model
Replace the model name with Qwen/Qwen3.6-27B-FP8 and keep the same launch parameters. Example with reduced tensor parallel size:
vllm serve Qwen/Qwen3.6-27B-FP8 \
--port 8000 \
--tensor-parallel-size 2 \
--max-model-len 131072 \
--reasoning-parser qwen3Sampling Parameters (official recommendations)
General thinking mode: temperature=1.0, top_p=0.95, top_k=20, presence_penalty=0.0 Precise coding (e.g., WebDev): temperature=0.6, top_p=0.95, top_k=20 Non‑thinking mode:
temperature=0.7, top_p=0.80, top_k=20, presence_penalty=1.5OpenAI‑Compatible API Usage
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
messages = [{"role": "user", "content": "Type \"I love Qwen3.6\" backwards"}]
resp = client.chat.completions.create(
model="Qwen/Qwen3.6-27B",
messages=messages,
max_tokens=81920,
temperature=1.0,
top_p=0.95,
presence_penalty=0.0,
extra_body={"top_k": 20},
)
print(resp)When thinking mode is enabled, responses include <think>...</think> blocks; switch to non‑thinking parameters to suppress them.
Multimodal Request Example
messages = [{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "https://your-image-url.jpg"}},
{"type": "text", "text": "这张图里有几个圆?"}
]
}]
resp = client.chat.completions.create(
model="Qwen/Qwen3.6-27B",
messages=messages,
max_tokens=81920,
temperature=1.0,
top_p=0.95,
extra_body={"top_k": 20},
)For video input, replace the type field with video_url.
Pros and Cons
Pros:
27 B dense size enables friendly deployment.
Agentic coding ability surpasses the 397 B MoE model.
Native 262 K context, extendable to 1 M tokens.
Multimodal + text capabilities in a single model.
FP8 quantized version halves memory requirements.
Full‑stack support: vLLM, SGLang, Transformers, KTransformers.
Cons:
Very hard reasoning tasks (e.g., HLE) still favor the 397 B model or Claude 4.5 Opus.
Default thinking mode adds latency; latency‑sensitive production may need to disable it.
Context length should not be reduced below 128 K, otherwise thinking degrades.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Old Zhang's AI Learning
AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
