Qwopus 3.6‑27B‑v2: Trace‑Inversion Distillation Cuts Token Use by 36% and Boosts Accuracy
The Qwopus 3.6‑27B‑v2 model reconstructs full step‑by‑step reasoning from compressed Claude outputs using a Trace‑Inverter, creates two high‑quality SFT datasets, and achieves 35.9% token savings, a 2.57‑point accuracy gain on MMLU‑Pro, 75.25% success on SWE‑bench, while running on a single consumer‑grade RTX 5090.
Overview
Qwopus 3.6‑27B‑v2 is a distilled version of Alibaba's Qwen 3.6‑27B dense model that adds inference‑enhancing SFT. The core idea is a Trace‑Inverter‑4B (based on Qwen‑3‑4B‑Instruct) that reverses the compressed “reasoning bubble” produced by commercial closed‑source models (Claude, GPT) back into a complete, learnable chain‑of‑thought (CoT). The recovered CoT is wrapped in a <think> tag and combined with the original prompt/response to form SFT samples.
Data Generation
Two datasets were produced:
Three‑Stage SFT Curriculum
Phase 1: Format Inception (< 4096 tokens, solidify format)</code>
<code>Phase 2: Complexity Expansion (4096‑8192 tokens, medium‑complexity reasoning)</code>
<code>Phase 3: Long‑Context SFT (8192‑32K tokens, long context + 10% replay)This progressive increase in context length and task complexity avoids long‑context failures.
Model Features
27B dense transformer with native 32K‑128K long‑context support.
Vision capability via mmproj.gguf and tool‑use/function‑calling.
Strict <think> tag format for downstream RL.
Cross‑source SFT alignment + multi‑teacher distillation to close capability gaps.
Training used the Unsloth framework.
Core Innovation: Trace‑Inversion
Traditional distillation copies the compressed output (the “bubble”) directly, causing logical gaps and poor generalisation. Trace‑Inversion first feeds the compressed output and answer into the Trace‑Inverter‑4B, which reconstructs a continuous CoT chain, then embeds it in <think> for SFT. The student model learns the full derivation instead of jump‑step conclusions.
The author calls this “Negentropy Reconstruction”, i.e., restoring the lost intermediate steps.
Performance Highlights
Token efficiency: average answer tokens drop from 1,433.3 to 918.7 (‑35.9%).
System‑level token overhead reduced by 14.2% (2,511.0 → 2,155.8).
Correct answers per 10 k tokens rise from 3.98 to 4.64 (+16.6%).
Thought‑chain length shrinks from 5,169.4 characters to 2,370.0 (‑54.1%).
MMLU‑Pro (350 questions, 7 categories)
Qwen 3.6‑27B: 297/350 correct (84.86%).
Qwopus 3.6‑27B‑v2: 306/350 correct (87.43%), a +2.57‑point gain.
Largest gains in Business, Physics, Chemistry; slight drops in Math and Health.
SWE‑bench
152/202 problems solved → 75.25% success rate.
Deployment & Throughput
GGUF quantizations from IQ4_XS to Q8_0 are provided, plus mmproj.gguf for vision.
On RTX 5090 (single card) with Q5_K_M quantization:
Despite higher throughput, the author recommends the dense 27B for complex agents, long context, and code tasks because its per‑token reasoning depth is stronger.
Running with llama.cpp
./llama-server \
-m Qwopus3.6-27B-v2-Q5_K_M.gguf \
--mmproj mmproj.gguf \
-c 32768 \
--jinja \
--temp 1.0For agent tasks, keep --temp 1.0; using a low temperature (greedy) can cause infinite loops in <think> blocks.
MTP Acceleration
The author open‑sourced Multi‑Token Prediction (MTP) heads for Qwen series.
Qwopus 3.6‑27B‑v2‑MTP runs 1.66× faster than the official Qwen 3.6 inference.
Training Data Sources
Two public datasets hosted on the author’s Hugging Face page, together providing 14,000 Trace‑Inversion samples.
The author emphasizes quality over quantity for this modest‑size dataset.
Author Reflections
Pros: novel trace‑inversion idea, high token efficiency, strong SWE‑bench results, complete ecosystem (full‑range GGUF, MTP, vision support).
Cons: evaluation limited to MMLU‑Pro subset, no third‑party reproducibility, experimental release without full safety assessment, unknown reconstruction accuracy of the Trace‑Inverter, modest regressions in Math and Health categories.
Target Audience
Researchers and developers who want a local 27B reasoning model.
Developers running agent or code tasks that need long context and tool use.
Anyone interested in distillation techniques and trace‑inversion research.
Users with consumer‑grade GPUs such as RTX 5090/4090 or Mac Studio.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Old Zhang's AI Learning
AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
