Distilling Claude Opus: Qwen 9B Coding Model Runs on Consumer GPUs – Real‑World Benchmarks

The Qwopus3.5‑9B‑Coder model, fine‑tuned for agentic coding, tool calling and logical reasoning, offers three formats (Safetensors, GGUF, GGUF+MTP), runs on a 16 GB Mac mini via LM‑Studio, achieves up to 35% throughput gain with MTP, scores 85 on HermesAgent‑20, 100 on ToolCall‑15, and 53.89% on SWE‑bench, matching Claude Opus 4.6 in a 31‑tool adversarial test while highlighting its training tricks and current limitations.

Old Zhang's AI Learning
Old Zhang's AI Learning
Old Zhang's AI Learning
Distilling Claude Opus: Qwen 9B Coding Model Runs on Consumer GPUs – Real‑World Benchmarks

Qwopus3.5‑9B‑Coder Overview

Jackrong released the Qwopus3.5‑9B‑Coder series, a 9‑billion‑parameter model specially optimized for Agentic Coding, Tool Calling, and logical reasoning. Three distribution formats are provided to suit different use cases.

Available Versions

Qwopus3.5‑9B‑Coder – Safetensors – intended as a research/fine‑tuning base.

Qwopus3.5‑9B‑Coder‑GGUF – GGUF quantized – for local deployment with Ollama or LM Studio.

Qwopus3.5‑9B‑Coder‑MTP‑GGUF – GGUF + MTP – designed for extreme speed in local deployment.

Local Deployment Experience

Using a 16 GB Mac Mini and LM‑Studio 0.4.16, the author selected the MTP‑GGUF version. The model loads in 7.14 GB of RAM, runs at 17 tokens/s (API calls around 12 tokens/s), and was evaluated on 25 real‑world programming tasks covering nine frequent development categories.

# Official Qwen3.5 sampling parameters balancing reasoning and creativity
Temperature: 1.0
Top-p: 0.95

Enabling MTP/Speculative Decoding in the settings is required.

MTP vs Base inference mechanism comparison
MTP vs Base inference mechanism comparison

Core Innovation: Trace Inversion

Commercial LLMs compress their reasoning chains, exposing only a condensed "thinking bubble". Trace Inversion reconstructs the full chain.

Train a proxy model (Trace‑Inverter‑4B) : Use open‑source GLM‑5.1 and DS‑V4 full‑chain data, compress them with Qwen‑3‑235B, and teach the small model to recover the full chain from the bubble.

Reverse‑engineer Claude‑4.7‑Max : Combine Claude’s compressed output with the final answer, then use Trace‑Inverter‑4B to rebuild the complete chain‑of‑thought.

Merge training data : Insert the recovered chain inside <think> tags and concatenate with the original Q&A pairs.

Second Innovation: Real Agent Trace Training

Approximately 10 000 high‑quality multi‑turn Tool Calling dialogues from GLM‑5.1 are used.

Each record contains a <think> reasoning segment and the actual tool execution result.

Scenarios cover terminal operations, code debugging, browser automation, file manipulation, etc.

Three‑Stage Curriculum Learning

The training follows a progressive curriculum: first stabilise the format, then increase task complexity, and finally reinforce long context handling while replaying short samples to prevent capability drift.

Training pipeline screenshot
Training pipeline screenshot

Benchmark Results

HermesAgent‑20 (complex agent tasks) – overall score 85, beating the original Qwen3.5‑9B (71) by 14 points.

ToolCall‑15 (tool calling stability) – perfect score 100, matching the original Qwen3.5‑9B.

BugFind‑15 (code bug fixing) – score 79, higher than comparable models.

SWE‑bench Verified (repository‑level coding) – 53.89%, surpassing Google Gemma‑4‑31B‑it (52%) and outperforming many 9B models, though still below Claude Opus 4.5 (80.9%).

Capability test report screenshot
Capability test report screenshot

Tool Calling vs Claude Opus

In a community adversarial test with 31 tools, Qwopus3.5‑9B‑Coder achieved 100% tool recall and correctly selected 27 out of 28 tools (96%), exactly matching Claude Opus 4.6.

Running the Model

With LM Studio, the model can be downloaded directly. For llama‑cpp, enable YaRN/RoPE scaling for contexts larger than 32 K:

./llama-server \
  -m model.gguf \
  --ctx-size 131072 \
  --rope-scaling yarn \
  --rope-scale 4 \
  --yarn-orig-ctx 32768

To enable vision and tool calling, place the mmproj.gguf file from the GGUF repository alongside the model file.

Conclusion

Tool Calling capability is outstanding, matching Claude Opus 4.6.

Agentic tasks receive a far‑higher overall score than peer 9B models.

MTP version delivers ~35% throughput improvement.

Only 16 GB of RAM is required for local execution.

Training methodology (Trace Inversion, real agent traces, three‑stage curriculum) is valuable for future research.

The model is vertically fine‑tuned, so some general‑purpose abilities may have regressed.

It lacks comprehensive general‑domain evaluation.

SWE‑bench score (53%) still trails top commercial models (≈81%).

MTP version can be unstable on short‑text edge cases.

It is an experimental community release intended for research exploration only.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Qwentool callingAgentic CodingLLM BenchmarkTrace InversionQwopus
Old Zhang's AI Learning
Written by

Old Zhang's AI Learning

AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.