Token Superposition Training: 2.5× Faster LLM Pre‑training Without Model Changes

The article presents Token Superposition Training (TST), which temporarily averages embeddings of non‑overlapping token bags and predicts groups of tokens in a first phase before reverting to standard token‑wise prediction, achieving up to 2.5× pre‑training speedup on 10B‑1B MoE models without altering model architecture or inference.

LLM pretrainingMCE lossMixture of Experts

0 likes · 9 min read

Token Superposition Training: 2.5× Faster LLM Pre‑training Without Model Changes

Machine Learning Algorithms & Natural Language Processing

May 16, 2026 · Artificial Intelligence

Token Superposition Training Accelerates LLM Pre‑training 2.5× Without Changing Architecture

Token Superposition Training (TST) speeds up large‑language‑model pre‑training by up to 2.5× without altering model architecture or compute budget, using a superposition phase that averages token embeddings into bags and predicts groups of tokens, followed by a standard recovery phase, as demonstrated on 10B‑parameter MoE and smaller models.

LLM pretrainingMCE lossMoE

0 likes · 10 min read

Token Superposition Training Accelerates LLM Pre‑training 2.5× Without Changing Architecture

Machine Learning Algorithms & Natural Language Processing

May 14, 2026 · Artificial Intelligence

Boosting LLM Pre‑training 2.5× Without Architecture Changes or Extra Compute

Nous Research introduces Token Superposition Training, which groups tokens into bags, averages their embeddings, and predicts token groups without altering model architecture or adding compute, achieving up to 2.5× faster pre‑training while maintaining standard inference.

LLM pretrainingMCE lossMoE

0 likes · 10 min read

Boosting LLM Pre‑training 2.5× Without Architecture Changes or Extra Compute

Baobao Algorithm Notes

Dec 23, 2024 · Artificial Intelligence

From Zero to One: A Practical Guide to Pretraining Large Language Models

This comprehensive guide walks through every stage of building a large‑language‑model pretraining pipeline—from data sourcing, cleaning, and deduplication, to tokenizer design, model architecture choices, training framework selection, optimization tricks, and evaluation methods—providing actionable tips and pitfalls to avoid for both newcomers and seasoned practitioners.

LLM pretrainingdata collectionscaling laws

0 likes · 33 min read

From Zero to One: A Practical Guide to Pretraining Large Language Models

NewBeeNLP

Sep 25, 2024 · Artificial Intelligence

From Zero to One: A Practical Guide to Pretraining Large Language Models

This comprehensive guide walks through every stage of LLM pretraining—from data sourcing, cleaning, and deduplication, to tokenizer design, model architecture choices, training framework selection, optimization tricks, and evaluation methods—offering actionable tips and pitfalls to avoid.

LLM pretrainingTraining Frameworkdata collection

0 likes · 32 min read

Baobao Algorithm Notes

Sep 24, 2024 · Artificial Intelligence

From Zero to One: A Practical Guide to Pretraining Large Language Models

This comprehensive guide walks you through every stage of LLM pretraining—from data sourcing, cleaning, and deduplication to tokenizer design, model architecture choices, training framework selection, optimization tricks, and evaluation methods—highlighting common pitfalls and practical solutions for building robust models.

Curriculum LearningData cleaningLLM pretraining

0 likes · 34 min read