Qwen3.6-35B-A3B NVFP4: A Stable, Highly Compressed Quantized Model
NVIDIA's NVFP4 quantization reduces Qwen3.6-35B-A3B's memory footprint by threefold with almost no accuracy loss, offers plug‑and‑play deployment via vLLM, and outperforms other 4‑bit formats on Hopper/Blackwell GPUs, making it a practical choice for production AI workloads.
Introduction
Qwen3.6-35B-A3B is the first open‑source MoE model of the Qwen3.6 series, released by Alibaba in April. NVIDIA has now released a quantized version (NVFP4) that compresses the model from 36 GB to about 19 GB while keeping the model's capabilities essentially unchanged, enabling inference on a single consumer‑grade GPU.
Why This Quantized Version Matters
Quantizing MoE models is challenging because, despite few active parameters per token, the total parameter size remains large and still strains GPU memory. NVIDIA’s NVFP4 approach quantizes only the linear‑layer weights and activations inside the transformer blocks, leaving other parts untouched, which the author describes as a "steady" strategy.
Core Advantages
Extreme compression : 16‑bit BF16 to 4‑bit NVFP4 reduces disk and VRAM demand by 3.06×.
Excellent accuracy retention : MMLU Pro drops only 0.6 points (85.6→85.0), GPQA Diamond drops 0.1 point (84.9→84.8), AIME 2025 drops 0.4 point.
Plug‑and‑play : Deployable with a single vLLM command, no extra quantization toolchain required.
Multimodal support : Retains Qwen3.6’s text, image, and video understanding with a 262K context window.
Benchmark Results
Across nine benchmarks the NVFP4 version shows near‑identical performance to BF16, with a few metrics even improving slightly. Selected scores:
MMLU Pro: BF16 85.6 → NVFP4 85.0
GPQA Diamond: BF16 84.9 → NVFP4 84.8
τ²‑Bench: BF16 95.5 → NVFP4 94.7
SciCode: BF16 40.8 → NVFP4 40.6
AIME 2025: BF16 89.2 → NVFP4 88.8
AA‑LCR: unchanged at 62.0
IFBench: BF16 62.3 → NVFP4 62.8 (slight rise)
MMMU PRO: BF16 74.1 → NVFP4 74.5 (slight rise)
Base Model Strengths
Compared with Qwen3.5, Qwen3.6 improves two key areas:
Agentic Coding : Significant gains in front‑end workflow and repository‑level code reasoning.
Thinking Preservation : Keeps reasoning context from previous messages, reducing redundant computation.
Architecturally, the model features 256 experts (8+1 active per token), mixed attention (Gated DeltaNet + Gated Attention), native 262K context (extendable to 1 M tokens via YaRN), and Multi‑Token Prediction (MTP) for faster inference.
Installation and Deployment
NVIDIA recommends using vLLM for deployment. A basic command is:
vllm serve nvidia/Qwen3.6-35B-A3B-NVFP4 --port 8000 --quantization modelopt --max-model-len 262144 --reasoning-parser qwen3Hardware requirements include Hopper (H100/H200) or Blackwell (B200/GB200) GPUs; the official test environment is NVIDIA GB300. For DGX Spark workstations, an optimized launch script with additional environment variables and flags is provided (see source).
Usage
After deployment, the model can be accessed via the standard OpenAI API:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
chat_response = client.chat.completions.create(
model="nvidia/Qwen3.6-35B-A3B-NVFP4",
messages=[{"role": "user", "content": "用 Python 写一个快排"}],
max_tokens=81920,
temperature=1.0,
top_p=0.95,
extra_body={"top_k": 20},
)
print(chat_response.choices[0].message.content)Recommended sampling parameters:
Thinking mode (general): temperature=1.0, top_p=0.95, top_k=20, presence_penalty=1.5
Thinking mode (precise coding): temperature=0.6, top_p=0.95, top_k=20, presence_penalty=0.0
Direct answer mode: temperature=0.7, top_p=0.80, top_k=20, presence_penalty=1.5
Comparison with Other Quantization Formats
Several 4‑bit formats exist for Qwen3.6‑35B‑A3B on HuggingFace:
NVFP4 (NVIDIA): safetensors, best for vLLM on Hopper/Blackwell GPUs.
GGUF (Unsloth/Bartowski): GGUF, suited for llama.cpp or Ollama on macOS.
AWQ 4‑bit (cyankiwi/QuantTrio): safetensors, works with vLLM or Transformers.
MLX 4‑bit (mlx‑community): MLX format, targets Apple Silicon.
Three Providers Side‑by‑Side (ModelScope)
Key differences among NVIDIA, Unsloth, and Red Hat AI:
Quantization tool : NVIDIA Model Optimizer v0.44.0; Unsloth uses the same optimizer in a custom flow; Red Hat uses vllm‑project/llm‑compressor.
Calibration data : NVIDIA uses cnn_dailymail + Nemotron‑Post‑Training‑v2; Unsloth uses UltraChat (16K sequences); Red Hat uses UltraChat (256 samples, max_len 4096).
File organization : NVIDIA splits into three safetensors (≈23 GB total) with visual/video preprocessing; Unsloth provides a single 22.99 GB safetensors file; Red Hat separates main weights (22.5 GB), MTP (1.69 GB), and vision (893 MB).
Total size : NVIDIA ≈23.4 GB, Unsloth ≈23.0 GB, Red Hat ≈25.0 GB.
Benchmark coverage : NVIDIA reports eight metrics; Unsloth only GSM8K + short MMLU‑Pro; Red Hat only GSM8K Platinum (100.69 % recovery).
Test hardware : NVIDIA uses GB300; others do not specify.
Upload dates : NVIDIA 2026‑05‑28, Unsloth 2026‑04‑30, Red Hat 2026‑04‑15.
Recommended launch command : NVIDIA – full vLLM config with DGX Spark optimizations; Unsloth – simple vLLM command with max‑model‑len 4096; Red Hat – vLLM command specifying moe_backend flashinfer_cutlass.
How to Choose
Enterprise‑grade GPUs (H100/H200/B200/GB200) → NVIDIA official (full benchmark, GB300 validation).
Desktop‑level GPUs or DGX Spark, text‑only workloads → Unsloth (single‑file loading, fastest start‑up).
Heavy reliance on vLLM toolchain, need independent MTP/vision weights → Red Hat AI (clean file structure).
All three versions incur only acceptable accuracy loss; if unsure, the NVIDIA official build is a safe default.
Conclusion
The NVIDIA Qwen3.6-35B-A3B‑NVFP4 quantization offers a three‑fold size reduction with virtually no accuracy degradation and a one‑line vLLM deployment, making it one of the best choices for running Qwen3.6 MoE models on Hopper/Blackwell GPUs in production environments.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Old Zhang's AI Learning
AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
