AMD Paper Finds FP4 Training Instability Is Not Due to Randomness, 9‑10% Speedup

The authors demonstrate that FP4 training instability stems from structural micro‑scaling errors in the weight‑gradient path rather than insufficient randomness, and show that a deterministic Hadamard rotation restores convergence, delivering a 9‑10% end‑to‑end speedup on native FP4 hardware (AMD MI355X) while incurring only 8‑9% token overhead.

Deterministic HadamardFP4MXFP4

0 likes · 10 min read

AMD Paper Finds FP4 Training Instability Is Not Due to Randomness, 9‑10% Speedup

Data Party THU

Sep 4, 2025 · Artificial Intelligence

How MXFP4 Quantization Lets a 1200‑Billion‑Parameter LLM Run on a Single 80GB GPU

This article analyzes the memory bottleneck of massive language models, explains the mathematical modeling of memory requirements, evaluates traditional sharding limits, and details how GPT‑OSS’s MXFP4 quantization combined with Mixture‑of‑Experts reduces memory, bandwidth, and compute demands enough to fit a 1200‑billion‑parameter model onto an 80 GB GPU with minimal accuracy loss.

FP4LLMMXFP4

0 likes · 11 min read

How MXFP4 Quantization Lets a 1200‑Billion‑Parameter LLM Run on a Single 80GB GPU