Tagged articles
1 articles
Page 1 of 1
Data Party THU
Data Party THU
May 29, 2026 · Artificial Intelligence

Token Superposition Training: 2.5× Faster LLM Pre‑training Without Model Changes

The article presents Token Superposition Training (TST), which temporarily averages embeddings of non‑overlapping token bags and predicts groups of tokens in a first phase before reverting to standard token‑wise prediction, achieving up to 2.5× pre‑training speedup on 10B‑1B MoE models without altering model architecture or inference.

LLM pretrainingMCE lossMixture of Experts
0 likes · 9 min read
Token Superposition Training: 2.5× Faster LLM Pre‑training Without Model Changes