AMD Paper Finds FP4 Training Instability Is Not Due to Randomness, 9‑10% Speedup

The authors demonstrate that FP4 training instability stems from structural micro‑scaling errors in the weight‑gradient path rather than insufficient randomness, and show that a deterministic Hadamard rotation restores convergence, delivering a 9‑10% end‑to‑end speedup on native FP4 hardware (AMD MI355X) while incurring only 8‑9% token overhead.

Machine Heart
Machine Heart
Machine Heart
AMD Paper Finds FP4 Training Instability Is Not Due to Randomness, 9‑10% Speedup

Background

Training large language models is extremely costly, and reducing numerical precision is a proven way to cut expenses. After the success of FP8 (e.g., DeepSeek‑V3 achieving a $5.6M training cost), the community has been probing the limits of lower precision, asking whether moving from FP8 to FP4 can further reduce costs.

Both NVIDIA Blackwell and AMD MI350 series provide native FP4 support, with the former claiming up to 4500 TOPS (sparse) on the B200. However, end‑to‑end FP4 training of LLMs has been notoriously unstable, and the root cause was unclear.

MXFP4 Format

The paper introduces MXFP4, a micro‑scaling quantization format. Instead of a single scale for an entire tensor, MXFP4 partitions a tensor into small blocks (e.g., 32 elements) and assigns each block a shared exponent (E8M0). Each element is stored as a 4‑bit floating‑point value, reconstructed by:

Because each block has its own dynamic range, MXFP4 avoids the “global outlier hijacking” problem of naïve quantization, improving representation quality at 4‑bit precision.

Diagnostic Experiments

The authors designed a step‑wise control experiment on a Transformer linear layer, isolating three matrix‑multiply operations:

Fprop : compute Y = XWᵀ (forward activation)

Dgrad : compute ∇X = ∇Y·W (activation gradient)

Wgrad : compute ∇W = (∇Y)ᵀ·X (weight gradient)

Starting from an FP8 baseline, they replaced each operation with MXFP4 on an AMD Instinct MI355X GPU (native FP4 tensor cores, no software emulation) while keeping all other factors constant. The training task follows the MLPerf C4 benchmark, pre‑training Llama 3.1‑8B to a validation perplexity of 3.3.

Replacing Fprop and Dgrad with MXFP4 caused only modest token overhead, but swapping Wgrad to MXFP4 caused the overhead to jump to 26‑27% and dramatically degraded convergence. The authors conclude that Wgrad is the bottleneck for FP4 training stability .

Randomness Strategies vs. Deterministic Hadamard

Prevailing intuition held that FP4 quantization error behaves like noise, so injecting randomness (stochastic rounding or randomized Hadamard rotation) should smooth the error distribution. The authors evaluated both:

Stochastic Rounding : randomizes rounding to make the expected error zero.

Randomized Hadamard : applies a random sign‑flipping Hadamard transform before quantization.

When Wgrad was quantized, both random strategies failed to stabilize training and even prevented convergence, introducing additional effective quantization error along the critical gradient path.

In contrast, a deterministic Hadamard rotation (fixed transform H16) reduced the total token overhead back to 8‑9% and produced a training trajectory that closely matched the FP8 baseline.

The key insight is that the instability originates from structural micro‑scaling error that accumulates along the sensitive weight‑gradient path . Randomness injects varying error patterns that amplify this accumulation, whereas a deterministic transform keeps the error pattern consistent, avoiding amplification.

End‑to‑End Efficiency

Combining deterministic Hadamard with full‑process MXFP4 yields the following results on the MI355X:

Training‑step throughput improves by 20%, and after accounting for the 8‑9% extra token cost, the overall end‑to‑end speedup remains at 9‑10%. This is notable given the precision drop from 8‑bit to 4‑bit.

Limitations

The authors stress that their results are validated only for the MLPerf C4 benchmark with Llama 3.1‑8B. They do not claim universal applicability across all models, datasets, or training pipelines; stability strategies may need re‑evaluation for different settings.

Broader Implications

The paper provides three layers of significance:

It offers a causal diagnosis: instability is driven by structural micro‑scaling error in the Wgrad path, not by lack of randomness.

It demonstrates that FP4 can be used for training, not just inference, potentially doubling usable training compute on hardware that already supports FP4.

It leverages the OCP Microscaling standard (MXFP4), backed by seven major companies, ensuring cross‑vendor portability.

Overall, the work marks a pivotal step from FP8 to FP4 in large‑model training, showing that each precision reduction can substantially improve training economics when the underlying error mechanisms are properly addressed.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

low-precision trainingFP4MXFP4Deterministic HadamardNative FP4 HardwareWeight Gradient
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.