Solving AdamW & Muon Instability: Pion Optimizer Updates Large Models on an Iso‑Spectral Manifold

The Pion optimizer leverages iso‑spectral manifold updates to preserve the spectral norm of weight matrices, eliminating additive‑update instability and enabling stable, efficient training of billion‑parameter LLMs across pre‑training, fine‑tuning, and reinforcement‑learning stages, outperforming AdamW and Muon.

Machine Heart
Machine Heart
Machine Heart
Solving AdamW & Muon Instability: Pion Optimizer Updates Large Models on an Iso‑Spectral Manifold

When large language models scale to hundreds of billions of parameters, two fundamental challenges arise: maintaining training stability and transferring hyper‑parameters across model sizes. Traditional optimizers such as AdamW and Muon rely on additive updates, which cause the spectral norm of weight matrices to grow unchecked, leading to exploding logits, drifting activation norms, loss spikes, and eventual training collapse.

Additive‑Update Dilemma

Both AdamW and Muon focus on rapid loss reduction but ignore the geometry of the weight matrix. Over time, additive accumulation changes both the magnitude and direction of parameters, inflating singular‑value spectra and unbalancing feature‑space scales. This undermines the Maximal Update Parameterization (μP) scaling laws and triggers instability.

POET Recap

Recent work POET introduced orthogonal left‑right transformations (R P) on weight matrices, preserving the spectrum because orthogonal rotations do not alter singular values. While POET achieves spectrum preservation, it requires explicit re‑parameterization and additional trainable orthogonal matrices, increasing system complexity.

Pion: Spectrum‑Preserving Optimizer without Re‑parameterization

Pion (POET‑induced Optimizer with No Reparameterization) embeds the spectrum‑preserving update directly into the optimizer. For any weight matrix W, Pion writes W = I · W · I, where the two identity factors are treated as zero‑rotation orthogonal matrices. During each step, gradients of these factors are computed, and updates are performed via skew‑symmetric matrices exponentiated back to the orthogonal group. The update rule (shown in the figure) applies left‑ and right‑hand orthogonal rotations generated from Lie‑algebra elements, ensuring that the singular values of W remain unchanged.

Consequently, Pion does not stretch the weight matrix; it only rotates it in feature space. Spectral norm, Frobenius norm, and overall matrix scale stay stable, while the row and column spaces continue to evolve.

μP‑Compatible Pion

Because Pion inherently keeps the spectral norm fixed, it naturally satisfies μP’s requirement that both weight matrices and their updates follow a fixed scaling law. The authors further normalize the Lie‑algebra factors to meet μP’s scaling law, enabling direct learning‑rate transfer across model widths. Experiments on LLaMA‑like and Qwen architectures show that the optimal learning rate for Pion transfers almost unchanged across scales.

Experimental Evaluation

Stability in Pre‑training : On a 1.3 B LLaMA‑like model, Pion maintains flat trajectories for attention‑logit magnitude, SwiGLU activation norm, and other stability metrics, whereas AdamW’s logits grow continuously and Muon’s activation norms still drift upward.

Normalization‑Free Training : Removing all normalization layers from a 60 M model causes AdamW and Muon to diverge with NaNs, while Pion remains stable throughout a 9.6 B‑token run, demonstrating that spectrum preservation can replace architectural scale‑control mechanisms.

Deep‑Network Stress Test : Extending a 60 M model from 8 to 200 layers and training on 50 B tokens shows Pion achieving the lowest loss‑trajectory standard deviation (0.0892) compared with AdamW (0.0931) and Muon (0.0927), indicating superior stability in extreme depth.

Supervised Fine‑tuning (SFT) : On Qwen2.5‑1.5 B and Llama3.2‑3 B, Pion balances plasticity and stability, attaining the highest in‑domain and out‑of‑domain scores on code generation and maintaining near‑optimal performance on mathematical reasoning while preserving prior capabilities.

Reinforcement Learning with Verifiable Reward (RLVR) : In RL settings (Qwen3‑1.7 B, DeepSeek‑R1‑Distill‑Qwen‑1.5 B) using the GRPO framework, Pion achieves the best average performance, faster convergence, and reduced variance, suggesting that spectrum‑preserving updates align well with the inductive bias of RL training.

Conclusion

Historically, optimizers were judged solely by convergence speed. As model scales grow, stability becomes equally critical. Pion demonstrates that embedding geometric constraints—specifically spectrum preservation—directly into the optimizer can replace many ad‑hoc training patches, offering a more controllable, structured, and long‑term stable optimization path for large models.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

large language modelstraining stabilityMuonAdamWiso-spectral manifoldPion optimizerμP
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.