AMD Paper Finds FP4 Training Instability Is Not Due to Randomness, 9‑10% Speedup
The authors demonstrate that FP4 training instability stems from structural micro‑scaling errors in the weight‑gradient path rather than insufficient randomness, and show that a deterministic Hadamard rotation restores convergence, delivering a 9‑10% end‑to‑end speedup on native FP4 hardware (AMD MI355X) while incurring only 8‑9% token overhead.
Background
Training large language models is extremely costly, and reducing numerical precision is a proven way to cut expenses. After the success of FP8 (e.g., DeepSeek‑V3 achieving a $5.6M training cost), the community has been probing the limits of lower precision, asking whether moving from FP8 to FP4 can further reduce costs.
Both NVIDIA Blackwell and AMD MI350 series provide native FP4 support, with the former claiming up to 4500 TOPS (sparse) on the B200. However, end‑to‑end FP4 training of LLMs has been notoriously unstable, and the root cause was unclear.
MXFP4 Format
The paper introduces MXFP4, a micro‑scaling quantization format. Instead of a single scale for an entire tensor, MXFP4 partitions a tensor into small blocks (e.g., 32 elements) and assigns each block a shared exponent (E8M0). Each element is stored as a 4‑bit floating‑point value, reconstructed by:
Because each block has its own dynamic range, MXFP4 avoids the “global outlier hijacking” problem of naïve quantization, improving representation quality at 4‑bit precision.
Diagnostic Experiments
The authors designed a step‑wise control experiment on a Transformer linear layer, isolating three matrix‑multiply operations:
Fprop : compute Y = XWᵀ (forward activation)
Dgrad : compute ∇X = ∇Y·W (activation gradient)
Wgrad : compute ∇W = (∇Y)ᵀ·X (weight gradient)
Starting from an FP8 baseline, they replaced each operation with MXFP4 on an AMD Instinct MI355X GPU (native FP4 tensor cores, no software emulation) while keeping all other factors constant. The training task follows the MLPerf C4 benchmark, pre‑training Llama 3.1‑8B to a validation perplexity of 3.3.
Replacing Fprop and Dgrad with MXFP4 caused only modest token overhead, but swapping Wgrad to MXFP4 caused the overhead to jump to 26‑27% and dramatically degraded convergence. The authors conclude that Wgrad is the bottleneck for FP4 training stability .
Randomness Strategies vs. Deterministic Hadamard
Prevailing intuition held that FP4 quantization error behaves like noise, so injecting randomness (stochastic rounding or randomized Hadamard rotation) should smooth the error distribution. The authors evaluated both:
Stochastic Rounding : randomizes rounding to make the expected error zero.
Randomized Hadamard : applies a random sign‑flipping Hadamard transform before quantization.
When Wgrad was quantized, both random strategies failed to stabilize training and even prevented convergence, introducing additional effective quantization error along the critical gradient path.
In contrast, a deterministic Hadamard rotation (fixed transform H16) reduced the total token overhead back to 8‑9% and produced a training trajectory that closely matched the FP8 baseline.
The key insight is that the instability originates from structural micro‑scaling error that accumulates along the sensitive weight‑gradient path . Randomness injects varying error patterns that amplify this accumulation, whereas a deterministic transform keeps the error pattern consistent, avoiding amplification.
End‑to‑End Efficiency
Combining deterministic Hadamard with full‑process MXFP4 yields the following results on the MI355X:
Training‑step throughput improves by 20%, and after accounting for the 8‑9% extra token cost, the overall end‑to‑end speedup remains at 9‑10%. This is notable given the precision drop from 8‑bit to 4‑bit.
Limitations
The authors stress that their results are validated only for the MLPerf C4 benchmark with Llama 3.1‑8B. They do not claim universal applicability across all models, datasets, or training pipelines; stability strategies may need re‑evaluation for different settings.
Broader Implications
The paper provides three layers of significance:
It offers a causal diagnosis: instability is driven by structural micro‑scaling error in the Wgrad path, not by lack of randomness.
It demonstrates that FP4 can be used for training, not just inference, potentially doubling usable training compute on hardware that already supports FP4.
It leverages the OCP Microscaling standard (MXFP4), backed by seven major companies, ensuring cross‑vendor portability.
Overall, the work marks a pivotal step from FP8 to FP4 in large‑model training, showing that each precision reduction can substantially improve training economics when the underlying error mechanisms are properly addressed.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
