11 min read

Qwopus 3.6‑27B‑v2: Trace‑Inversion Distillation Cuts Token Use by 36% and Boosts Accuracy

The Qwopus 3.6‑27B‑v2 model reconstructs full step‑by‑step reasoning from compressed Claude outputs using a Trace‑Inverter, creates two high‑quality SFT datasets, and achieves 35.9% token savings, a 2.57‑point accuracy gain on MMLU‑Pro, 75.25% success on SWE‑bench, while running on a single consumer‑grade RTX 5090.

Old Zhang's AI Learning

May 23, 2026

Qwopus 3.6‑27B‑v2: Trace‑Inversion Distillation Cuts Token Use by 36% and Boosts Accuracy

Overview

Qwopus 3.6‑27B‑v2 is a distilled version of Alibaba's Qwen 3.6‑27B dense model that adds inference‑enhancing SFT. The core idea is a Trace‑Inverter‑4B (based on Qwen‑3‑4B‑Instruct) that reverses the compressed “reasoning bubble” produced by commercial closed‑source models (Claude, GPT) back into a complete, learnable chain‑of‑thought (CoT). The recovered CoT is wrapped in a <think> tag and combined with the original prompt/response to form SFT samples.

Data Generation

Two datasets were produced:

Three‑Stage SFT Curriculum

Phase 1: Format Inception      (< 4096 tokens, solidify format)</code>
<code>Phase 2: Complexity Expansion (4096‑8192 tokens, medium‑complexity reasoning)</code>
<code>Phase 3: Long‑Context SFT   (8192‑32K tokens, long context + 10% replay)

This progressive increase in context length and task complexity avoids long‑context failures.

Model Features

27B dense transformer with native 32K‑128K long‑context support.

Vision capability via mmproj.gguf and tool‑use/function‑calling.

Strict <think> tag format for downstream RL.

Cross‑source SFT alignment + multi‑teacher distillation to close capability gaps.

Training used the Unsloth framework.

Core Innovation: Trace‑Inversion

Traditional distillation copies the compressed output (the “bubble”) directly, causing logical gaps and poor generalisation. Trace‑Inversion first feeds the compressed output and answer into the Trace‑Inverter‑4B, which reconstructs a continuous CoT chain, then embeds it in <think> for SFT. The student model learns the full derivation instead of jump‑step conclusions.

The author calls this “Negentropy Reconstruction”, i.e., restoring the lost intermediate steps.

Performance Highlights

Token efficiency: average answer tokens drop from 1,433.3 to 918.7 (‑35.9%).

System‑level token overhead reduced by 14.2% (2,511.0 → 2,155.8).

Correct answers per 10 k tokens rise from 3.98 to 4.64 (+16.6%).

Thought‑chain length shrinks from 5,169.4 characters to 2,370.0 (‑54.1%).

MMLU‑Pro (350 questions, 7 categories)

Qwen 3.6‑27B: 297/350 correct (84.86%).

Qwopus 3.6‑27B‑v2: 306/350 correct (87.43%), a +2.57‑point gain.

Largest gains in Business, Physics, Chemistry; slight drops in Math and Health.

SWE‑bench

152/202 problems solved → 75.25% success rate.

Deployment & Throughput

GGUF quantizations from IQ4_XS to Q8_0 are provided, plus mmproj.gguf for vision.

On RTX 5090 (single card) with Q5_K_M quantization:

Despite higher throughput, the author recommends the dense 27B for complex agents, long context, and code tasks because its per‑token reasoning depth is stronger.

Running with llama.cpp

./llama-server \
    -m Qwopus3.6-27B-v2-Q5_K_M.gguf \
    --mmproj mmproj.gguf \
    -c 32768 \
    --jinja \
    --temp 1.0

For agent tasks, keep --temp 1.0; using a low temperature (greedy) can cause infinite loops in <think> blocks.

MTP Acceleration

The author open‑sourced Multi‑Token Prediction (MTP) heads for Qwen series.

Qwopus 3.6‑27B‑v2‑MTP runs 1.66× faster than the official Qwen 3.6 inference.

Training Data Sources

Two public datasets hosted on the author’s Hugging Face page, together providing 14,000 Trace‑Inversion samples.

The author emphasizes quality over quantity for this modest‑size dataset.

Author Reflections

Pros: novel trace‑inversion idea, high token efficiency, strong SWE‑bench results, complete ecosystem (full‑range GGUF, MTP, vision support).

Cons: evaluation limited to MMLU‑Pro subset, no third‑party reproducibility, experimental release without full safety assessment, unknown reconstruction accuracy of the Trace‑Inverter, modest regressions in Math and Health categories.

Target Audience

Researchers and developers who want a local 27B reasoning model.

Developers running agent or code tasks that need long context and tool use.

Anyone interested in distillation techniques and trace‑inversion research.

Users with consumer‑grade GPUs such as RTX 5090/4090 or Mac Studio.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Qwen model distillation SWE-bench token efficiency GGUF MMLU Trace Inversion

Written by

Old Zhang's AI Learning

AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.