Why Randomly Masking Gradients Can Outperform Adam in Large‑Scale Model Training
The article explains how randomly masking a large portion of gradient updates during large‑model training—sometimes up to 99%—can accelerate convergence and even surpass traditional optimizers like Adam, supported by recent Google research and empirical observations.
