Distilling Claude Opus: Qwen 9B Coding Model Runs on Consumer GPUs – Real‑World Benchmarks
The Qwopus3.5‑9B‑Coder model, fine‑tuned for agentic coding, tool calling and logical reasoning, offers three formats (Safetensors, GGUF, GGUF+MTP), runs on a 16 GB Mac mini via LM‑Studio, achieves up to 35% throughput gain with MTP, scores 85 on HermesAgent‑20, 100 on ToolCall‑15, and 53.89% on SWE‑bench, matching Claude Opus 4.6 in a 31‑tool adversarial test while highlighting its training tricks and current limitations.
Qwopus3.5‑9B‑Coder Overview
Jackrong released the Qwopus3.5‑9B‑Coder series, a 9‑billion‑parameter model specially optimized for Agentic Coding, Tool Calling, and logical reasoning. Three distribution formats are provided to suit different use cases.
Available Versions
Qwopus3.5‑9B‑Coder – Safetensors – intended as a research/fine‑tuning base.
Qwopus3.5‑9B‑Coder‑GGUF – GGUF quantized – for local deployment with Ollama or LM Studio.
Qwopus3.5‑9B‑Coder‑MTP‑GGUF – GGUF + MTP – designed for extreme speed in local deployment.
Local Deployment Experience
Using a 16 GB Mac Mini and LM‑Studio 0.4.16, the author selected the MTP‑GGUF version. The model loads in 7.14 GB of RAM, runs at 17 tokens/s (API calls around 12 tokens/s), and was evaluated on 25 real‑world programming tasks covering nine frequent development categories.
# Official Qwen3.5 sampling parameters balancing reasoning and creativity
Temperature: 1.0
Top-p: 0.95Enabling MTP/Speculative Decoding in the settings is required.
Core Innovation: Trace Inversion
Commercial LLMs compress their reasoning chains, exposing only a condensed "thinking bubble". Trace Inversion reconstructs the full chain.
Train a proxy model (Trace‑Inverter‑4B) : Use open‑source GLM‑5.1 and DS‑V4 full‑chain data, compress them with Qwen‑3‑235B, and teach the small model to recover the full chain from the bubble.
Reverse‑engineer Claude‑4.7‑Max : Combine Claude’s compressed output with the final answer, then use Trace‑Inverter‑4B to rebuild the complete chain‑of‑thought.
Merge training data : Insert the recovered chain inside <think> tags and concatenate with the original Q&A pairs.
Second Innovation: Real Agent Trace Training
Approximately 10 000 high‑quality multi‑turn Tool Calling dialogues from GLM‑5.1 are used.
Each record contains a <think> reasoning segment and the actual tool execution result.
Scenarios cover terminal operations, code debugging, browser automation, file manipulation, etc.
Three‑Stage Curriculum Learning
The training follows a progressive curriculum: first stabilise the format, then increase task complexity, and finally reinforce long context handling while replaying short samples to prevent capability drift.
Benchmark Results
HermesAgent‑20 (complex agent tasks) – overall score 85, beating the original Qwen3.5‑9B (71) by 14 points.
ToolCall‑15 (tool calling stability) – perfect score 100, matching the original Qwen3.5‑9B.
BugFind‑15 (code bug fixing) – score 79, higher than comparable models.
SWE‑bench Verified (repository‑level coding) – 53.89%, surpassing Google Gemma‑4‑31B‑it (52%) and outperforming many 9B models, though still below Claude Opus 4.5 (80.9%).
Tool Calling vs Claude Opus
In a community adversarial test with 31 tools, Qwopus3.5‑9B‑Coder achieved 100% tool recall and correctly selected 27 out of 28 tools (96%), exactly matching Claude Opus 4.6.
Running the Model
With LM Studio, the model can be downloaded directly. For llama‑cpp, enable YaRN/RoPE scaling for contexts larger than 32 K:
./llama-server \
-m model.gguf \
--ctx-size 131072 \
--rope-scaling yarn \
--rope-scale 4 \
--yarn-orig-ctx 32768To enable vision and tool calling, place the mmproj.gguf file from the GGUF repository alongside the model file.
Conclusion
Tool Calling capability is outstanding, matching Claude Opus 4.6.
Agentic tasks receive a far‑higher overall score than peer 9B models.
MTP version delivers ~35% throughput improvement.
Only 16 GB of RAM is required for local execution.
Training methodology (Trace Inversion, real agent traces, three‑stage curriculum) is valuable for future research.
The model is vertically fine‑tuned, so some general‑purpose abilities may have regressed.
It lacks comprehensive general‑domain evaluation.
SWE‑bench score (53%) still trails top commercial models (≈81%).
MTP version can be unstable on short‑text edge cases.
It is an experimental community release intended for research exploration only.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Old Zhang's AI Learning
AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
