May 29, 2026 · Artificial Intelligence

Token Superposition Training: 2.5× Faster LLM Pre‑training Without Model Changes

The article presents Token Superposition Training (TST), which temporarily averages embeddings of non‑overlapping token bags and predicts groups of tokens in a first phase before reverting to standard token‑wise prediction, achieving up to 2.5× pre‑training speedup on 10B‑1B MoE models without altering model architecture or inference.

LLM pretrainingMCE lossMixture of Experts

0 likes · 9 min read

Token Superposition Training: 2.5× Faster LLM Pre‑training Without Model Changes