Artificial Intelligence 12 min read

Qwen3.6-27B Open‑Source: How a 27B Dense Model Outperforms the 397B Giant

The newly released Qwen3.6-27B dense multimodal model, at just 27 B parameters, surpasses the 397 B flagship on most encoding benchmarks, offers up to 1 M token context, supports FP8 quantization, and can be deployed locally via vLLM, SGLang or Transformers with modest hardware.

Old Zhang's AI Learning

Apr 22, 2026

Qwen3.6-27B Open‑Source: How a 27B Dense Model Outperforms the 397B Giant

Model Overview

Qwen3.6-27B is a 27 B dense multimodal model that surpasses the previous open‑source flagship Qwen3.5‑397B‑A17B on most coding benchmarks.

SWE‑bench Verified: 77.2 (vs 76.2 for 3.5‑397B)

SWE‑bench Pro: 53.5 (vs 50.9)

Terminal‑Bench 2.0: 59.3 (vs 52.5)

SkillsBench Avg5: 48.2 (vs 30.0)

GPQA Diamond: 87.8

AIME 2026: 94.1

Compared with the closed‑source Claude 4.5 Opus, the gap on coding benchmarks is 1‑5 points and Terminal‑Bench scores are identical (59.3).

Key Advantages

Agentic coding: Real‑world coding tasks, especially front‑end and repository‑level, outperform Claude.

Thinking preservation: Multi‑turn conversations keep reasoning context, avoiding repeated “thinking” in iterative coding.

Architecture

Parameters: 27 B dense (no MoE)

Layers: 64, hidden dimension 5120

Native context length: 262 144 tokens, extendable to 1 010 000 tokens

Hidden layout:

16 × (3 × (Gated DeltaNet → FFN) → 1 × (Gated Attention → FFN))

Multimodal vision encoder supports images, video, and documents

Supports MTP (Multi‑Token Prediction) for inference speedup

Gated DeltaNet + Gated Attention mix is more memory‑friendly than pure attention for long contexts.

FP8 Quantized Version

The FP8 weight file is ~30 GB. Using the Qwen/Qwen3.6-27B-FP8 checkpoint halves memory usage while performance loss is reported as negligible.

Why 27 B Is a Sweet Spot

Easy deployment: Dense architecture works directly with vLLM or SGLang without expert parallelism.

Moderate hardware requirements: BF16 needs ~54 GB VRAM (e.g., 2 × A100 40 GB, 1 × H100 80 GB, or 4 × RTX 4090). FP8 needs ~27 GB (single 48 GB L40S/A6000).

No capability compromise: Benchmarks show it outperforms the 397 B model.

Fully open weights: Available on Hugging Face and ModelScope for unrestricted commercial use.

Local Deployment Options

Officially supported routes: vLLM, SGLang, and Hugging Face Transformers. KTransformers also supports CPU‑GPU heterogeneous inference.

vLLM Deployment (recommended)

uv pip install vllm --torch-backend=auto

vllm serve Qwen/Qwen3.6-27B \
  --port 8000 \
  --tensor-parallel-size 8 \
  --max-model-len 262144 \
  --reasoning-parser qwen3

With tool‑call (required for coding agents):

vllm serve Qwen/Qwen3.6-27B \
  --port 8000 \
  --tensor-parallel-size 8 \
  --max-model-len 262144 \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder

Enable MTP (speculative decoding):

vllm serve Qwen/Qwen3.6-27B \
  --port 8000 \
  --tensor-parallel-size 8 \
  --max-model-len 262144 \
  --reasoning-parser qwen3 \
  --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'

Text‑only mode (drops vision encoder):

vllm serve Qwen/Qwen3.6-27B \
  --port 8000 \
  --tensor-parallel-size 8 \
  --max-model-len 262144 \
  --reasoning-parser qwen3 \
  --language-model-only

OOM tip: If out‑of‑memory occurs, do not reduce context below 128 K; the model’s thinking ability degrades sharply.

SGLang Deployment

uv pip install sglang[all]

python -m sglang.launch_server \
  --model-path Qwen/Qwen3.6-27B \
  --port 8000 \
  --tp-size 8 \
  --mem-fraction-static 0.8 \
  --context-length 262144 \
  --reasoning-parser qwen3

With tool use:

python -m sglang.launch_server \
  --model-path Qwen/Qwen3.6-27B \
  --port 8000 \
  --tp-size 8 \
  --mem-fraction-static 0.8 \
  --context-length 262144 \
  --reasoning-parser qwen3 \
  --tool-call-parser qwen3_coder

Enable speculative MTP:

python -m sglang.launch_server \
  --model-path Qwen/Qwen3.6-27B \
  --port 8000 \
  --tp-size 8 \
  --mem-fraction-static 0.8 \
  --context-length 262144 \
  --reasoning-parser qwen3 \
  --speculative-algo NEXTN \
  --speculative-num-steps 3 \
  --speculative-eagle-topk 1 \
  --speculative-num-draft-tokens 4

Transformers Lightweight Deployment (testing only)

pip install "transformers[serving]"
transformers serve Qwen/Qwen3.6-27B --port 8000 --continuous-batching

This option is suitable for experiments; production should use vLLM or SGLang.

FP8 Quantized Model

Replace the model name with Qwen/Qwen3.6-27B-FP8 and keep the same launch parameters. Example with reduced tensor parallel size:

vllm serve Qwen/Qwen3.6-27B-FP8 \
  --port 8000 \
  --tensor-parallel-size 2 \
  --max-model-len 131072 \
  --reasoning-parser qwen3

Sampling Parameters (official recommendations)

General thinking mode: temperature=1.0, top_p=0.95, top_k=20, presence_penalty=0.0 Precise coding (e.g., WebDev): temperature=0.6, top_p=0.95, top_k=20 Non‑thinking mode:

temperature=0.7, top_p=0.80, top_k=20, presence_penalty=1.5

OpenAI‑Compatible API Usage

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
messages = [{"role": "user", "content": "Type \"I love Qwen3.6\" backwards"}]
resp = client.chat.completions.create(
    model="Qwen/Qwen3.6-27B",
    messages=messages,
    max_tokens=81920,
    temperature=1.0,
    top_p=0.95,
    presence_penalty=0.0,
    extra_body={"top_k": 20},
)
print(resp)

When thinking mode is enabled, responses include <think>...</think> blocks; switch to non‑thinking parameters to suppress them.

Multimodal Request Example

messages = [{
    "role": "user",
    "content": [
        {"type": "image_url", "image_url": {"url": "https://your-image-url.jpg"}},
        {"type": "text", "text": "这张图里有几个圆？"}
    ]
}]
resp = client.chat.completions.create(
    model="Qwen/Qwen3.6-27B",
    messages=messages,
    max_tokens=81920,
    temperature=1.0,
    top_p=0.95,
    extra_body={"top_k": 20},
)

For video input, replace the type field with video_url.

Pros and Cons

Pros:

27 B dense size enables friendly deployment.

Agentic coding ability surpasses the 397 B MoE model.

Native 262 K context, extendable to 1 M tokens.

Multimodal + text capabilities in a single model.

FP8 quantized version halves memory requirements.

Full‑stack support: vLLM, SGLang, Transformers, KTransformers.

Cons:

Very hard reasoning tasks (e.g., HLE) still favor the 397 B model or Claude 4.5 Opus.

Default thinking mode adds latency; latency‑sensitive production may need to disable it.

Context length should not be reduced below 128 K, otherwise thinking degrades.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

vLLM benchmark Qwen local deployment FP8 Dense Model 27B

Written by

Old Zhang's AI Learning

AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Model Overview

Key Advantages

Architecture

FP8 Quantized Version

Why 27 B Is a Sweet Spot

Local Deployment Options

vLLM Deployment (recommended)

SGLang Deployment

Transformers Lightweight Deployment (testing only)

FP8 Quantized Model

Sampling Parameters (official recommendations)

OpenAI‑Compatible API Usage

Multimodal Request Example

Pros and Cons

Old Zhang's AI Learning

How this landed with the community

Was this worth your time?

0 Comments

Why 27 B Is a Sweet Spot