Claude‑Opus‑4.6 Distilled Qwen3.5 v2: Faster Reasoning with Same Code Accuracy

The new Claude‑Opus‑4.6 distilled Qwen3.5‑v2 keeps code‑generation accuracy while cutting reasoning length by 24% and boosting per‑token correctness by 31.6%, offering a noticeable speed and cost advantage for local LLM deployment despite a 7.2% drop on MMLU‑Pro.

Old Zhang's AI Learning
Old Zhang's AI Learning
Old Zhang's AI Learning
Claude‑Opus‑4.6 Distilled Qwen3.5 v2: Faster Reasoning with Same Code Accuracy

What’s new in v2?

The upgrade focuses on speed and efficiency rather than raw accuracy: the reasoning chain is 24% shorter and each token yields 31.6% more correct answers, while HumanEval pass@1 stays virtually unchanged at 96.91% (v1 was 96.95%). The only notable regression is a 7.2% drop on MMLU‑Pro.

How the improvement was achieved

Jackrong fine‑tuned Qwen3.5‑27B using Unsloth + LoRA SFT with a Response‑Only Training regime that supervises only the assistant’s thinking segment. The key ingredient is a curated set of ≈14,000 Claude‑4.6 Opus‑style general‑reasoning samples (math, logic, text) – deliberately excluding code questions.

This design teaches the model a more efficient “thinking scaffold”. The resulting reasoning pattern looks like:

Let me analyze this request carefully:

1. Identify the core objective of the problem.
2. Break the task into clearly defined subcomponents.
3. Evaluate constraints and edge cases.
4. Formulate a step‑by‑step solution plan.
5. Execute the reasoning sequentially and verify consistency.

Compared with v1’s verbose chain‑of‑thought, v2 behaves like an experienced engineer who outlines first and then proceeds, yielding a structured, concise output.

Training details

Base model: Qwen3.5‑27B

Framework: Unsloth + LoRA SFT

Method: Response‑Only Training (masking on "<|im_start|>assistant\n<think>")

Data volume: ~14k high‑quality reasoning trajectories

Datasets used:

Opus‑4.6‑Reasoning‑3000x‑filtered

claude‑opus‑4.6‑10000x

claude‑4.5‑opus‑high‑reasoning‑250x

Qwen3.5‑reasoning‑700x

Base Model (Qwen3.5‑27B)
  │
  ▼
Qwen3.5‑27B fine‑tuned with Unsloth
  │
  ▼
Supervised Fine‑Tuning (SFT) + LoRA
(Response‑Only Training masked on "<|im_start|>assistant
<think>")
  │
  ▼
Jackrong/Qwen3.5‑27B‑Claude‑4.6‑Opus‑Reasoning‑Distilled‑v2

Trade‑offs

The speed gains come at the cost of general knowledge reasoning: MMLU‑Pro accuracy falls by 7.2%, which the author attributes to the SFT data focusing on generic reasoning rather than long‑context or multi‑step tasks.

Running the model locally

Deployment requirements are unchanged: a single 4‑bit Qwen3.5‑27B can run on one RTX 4090. The GGUF checkpoint is available on HuggingFace (Jackrong/Qwen3.5‑27B‑Claude‑4.6‑Opus‑Reasoning‑Distilled‑v2‑GGUF) and works with LM Studio, llama.cpp, or Ollama.

In the author’s tests, the previous v1 achieved ~46 tokens/s on a 4090; with a 24% shorter chain, v2 effectively runs noticeably faster without additional hardware.

Bottom line

For local deployment scenarios where inference speed is the bottleneck, v2 delivers the same coding performance (HumanEval 96.91%) while using fewer tokens, cutting cost and latency, albeit with a modest loss in broad knowledge tasks.

Code accuracy unchanged: HumanEval 96.91%

Reasoning chain shortened 24% → faster generation

Per‑token correctness +31.6%

General knowledge (MMLU‑Pro) down 7.2%

Use the model when you prioritize fast, reliable code or logical reasoning over a wide‑range conversational ability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

distillationlocal LLM deploymentClaude Opusqwen3.5reasoning efficiency
Old Zhang's AI Learning
Written by

Old Zhang's AI Learning

AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.