May 31, 2026 · Artificial Intelligence

Qwen3.6-35B-A3B NVFP4: A Stable, Highly Compressed Quantized Model

NVIDIA's NVFP4 quantization reduces Qwen3.6-35B-A3B's memory footprint by threefold with almost no accuracy loss, offers plug‑and‑play deployment via vLLM, and outperforms other 4‑bit formats on Hopper/Blackwell GPUs, making it a practical choice for production AI workloads.

MoENVFP4Qwen3.6-35B-A3B

0 likes · 13 min read

Qwen3.6-35B-A3B NVFP4: A Stable, Highly Compressed Quantized Model

Old Zhang's AI Learning

May 11, 2026 · Artificial Intelligence

Open‑Source Qwen3.6‑35B‑A3B Runs at 162 tok/s on a Single RTX 5090

The article introduces the open‑source Qwen3.6‑35B‑A3B model, explains its MoE architecture, three‑stage LoRA fine‑tuning, shows benchmark results where it achieves 161.9 tok/s on an RTX 5090—2.6× faster than a dense 27B counterpart—and discusses deployment tips, quantized GGUF release, and known compatibility pitfalls.

GGUF quantizationLarge Language ModelLoRA fine-tuning

0 likes · 7 min read

Open‑Source Qwen3.6‑35B‑A3B Runs at 162 tok/s on a Single RTX 5090

Qwen3.6-35B-A3B NVFP4: A Stable, Highly Compressed Quantized Model

Open‑Source Qwen3.6‑35B‑A3B Runs at 162 tok/s on a Single RTX 5090

Open‑Source Qwen3.6‑35B‑A3B Runs at 162 tok/s on a Single RTX 5090