Token Superposition Training: 2.5× Faster LLM Pre‑training Without Model Changes
The article presents Token Superposition Training (TST), which temporarily averages embeddings of non‑overlapping token bags and predicts groups of tokens in a first phase before reverting to standard token‑wise prediction, achieving up to 2.5× pre‑training speedup on 10B‑1B MoE models without altering model architecture or inference.
