Claude‑Opus‑4.6 Distilled Qwen3.5 v2: Faster Reasoning with Same Code Accuracy
The new Claude‑Opus‑4.6 distilled Qwen3.5‑v2 keeps code‑generation accuracy while cutting reasoning length by 24% and boosting per‑token correctness by 31.6%, offering a noticeable speed and cost advantage for local LLM deployment despite a 7.2% drop on MMLU‑Pro.
What’s new in v2?
The upgrade focuses on speed and efficiency rather than raw accuracy: the reasoning chain is 24% shorter and each token yields 31.6% more correct answers, while HumanEval pass@1 stays virtually unchanged at 96.91% (v1 was 96.95%). The only notable regression is a 7.2% drop on MMLU‑Pro.
How the improvement was achieved
Jackrong fine‑tuned Qwen3.5‑27B using Unsloth + LoRA SFT with a Response‑Only Training regime that supervises only the assistant’s thinking segment. The key ingredient is a curated set of ≈14,000 Claude‑4.6 Opus‑style general‑reasoning samples (math, logic, text) – deliberately excluding code questions.
This design teaches the model a more efficient “thinking scaffold”. The resulting reasoning pattern looks like:
Let me analyze this request carefully:
1. Identify the core objective of the problem.
2. Break the task into clearly defined subcomponents.
3. Evaluate constraints and edge cases.
4. Formulate a step‑by‑step solution plan.
5. Execute the reasoning sequentially and verify consistency.Compared with v1’s verbose chain‑of‑thought, v2 behaves like an experienced engineer who outlines first and then proceeds, yielding a structured, concise output.
Training details
Base model: Qwen3.5‑27B
Framework: Unsloth + LoRA SFT
Method: Response‑Only Training (masking on "<|im_start|>assistant\n<think>")
Data volume: ~14k high‑quality reasoning trajectories
Datasets used:
Opus‑4.6‑Reasoning‑3000x‑filtered
claude‑opus‑4.6‑10000x
claude‑4.5‑opus‑high‑reasoning‑250x
Qwen3.5‑reasoning‑700x
Base Model (Qwen3.5‑27B)
│
▼
Qwen3.5‑27B fine‑tuned with Unsloth
│
▼
Supervised Fine‑Tuning (SFT) + LoRA
(Response‑Only Training masked on "<|im_start|>assistant
<think>")
│
▼
Jackrong/Qwen3.5‑27B‑Claude‑4.6‑Opus‑Reasoning‑Distilled‑v2Trade‑offs
The speed gains come at the cost of general knowledge reasoning: MMLU‑Pro accuracy falls by 7.2%, which the author attributes to the SFT data focusing on generic reasoning rather than long‑context or multi‑step tasks.
Running the model locally
Deployment requirements are unchanged: a single 4‑bit Qwen3.5‑27B can run on one RTX 4090. The GGUF checkpoint is available on HuggingFace (Jackrong/Qwen3.5‑27B‑Claude‑4.6‑Opus‑Reasoning‑Distilled‑v2‑GGUF) and works with LM Studio, llama.cpp, or Ollama.
In the author’s tests, the previous v1 achieved ~46 tokens/s on a 4090; with a 24% shorter chain, v2 effectively runs noticeably faster without additional hardware.
Bottom line
For local deployment scenarios where inference speed is the bottleneck, v2 delivers the same coding performance (HumanEval 96.91%) while using fewer tokens, cutting cost and latency, albeit with a modest loss in broad knowledge tasks.
Code accuracy unchanged: HumanEval 96.91%
Reasoning chain shortened 24% → faster generation
Per‑token correctness +31.6%
General knowledge (MMLU‑Pro) down 7.2%
Use the model when you prioritize fast, reliable code or logical reasoning over a wide‑range conversational ability.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Old Zhang's AI Learning
AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
