Next-ToBE: Enabling Overconfident LLMs to See Further and Reason More Accurately
The ICLR 2026 paper introduces Next‑ToBE, a training‑objective modification that replaces the one‑hot next‑token label with a soft distribution over a future token window, unlocking latent foresight in LLMs, improving future‑token hit rate, downstream reasoning performance, and reducing training memory and time.
Limitations of standard next‑token prediction
Large language models are typically trained with the next‑token prediction (NTP) objective, which supervises only the immediate next token. This objective yields fluent text generation but limits performance on tasks that require long‑range planning such as mathematical reasoning, code generation, and multi‑step decision making.
Future‑tokens Hit Rate (FtHR) metric
To quantify how much information about future tokens is already present in the current prediction distribution, the authors introduce the Future‑tokens Hit Rate (FtHR) metric. Experiments show that even under standard NTP the distribution covers a substantial portion of future tokens, and a higher rank of a future token in the current distribution correlates with a higher probability of being generated correctly later.
Next‑ToBE training objective
Next‑ToBE keeps the original one‑hot next‑token loss as the primary term and adds a soft target distribution over a window of k future tokens. The auxiliary loss weight for each future token is determined by two factors:
Model’s prior preference : tokens that already receive higher probability from the model are given larger weight.
Temporal‑semantic relationship : tokens that are temporally closer to the current step and semantically more related to the context receive higher weight.
The combined loss is L = L_{next} + λ·L_{future}, where λ balances the primary next‑token term and the auxiliary future‑token term. No additional prediction heads are introduced, so the model architecture remains unchanged.
Experimental setup
Fine‑tuning experiments were conducted on three base models:
Qwen2.5‑Math‑1.5B
Qwen2.5‑Math‑7B
Llama3.1‑8B‑Instruct
Each model was evaluated on three downstream task families: mathematical reasoning, code generation, and commonsense inference. For each task family, multiple benchmarks were used, yielding a total of 36 paired evaluations.
Results: foresight gains and downstream performance
After Next‑ToBE fine‑tuning, FtHR increased sharply and multi‑step generation accuracy improved in tandem, while next‑token confidence showed a modest decline.
Across the 36 paired evaluations, Next‑ToBE achieved the best result in 35 cases.
Effect of the λ coefficient
Increasing λ reduces next‑token confidence but yields a non‑monotonic effect on task accuracy: accuracy first rises, then falls, forming an inverted‑U shape. This indicates that moderate encouragement of foresight improves reasoning without overly sacrificing certainty.
Pre‑training from scratch
Next‑ToBE was also applied during pre‑training of GPT‑2 (124 M) on WikiText‑103. Compared with standard NTP and a representative Multi‑token Prediction (MTP) baseline, Next‑ToBE improved FtHR and HellaSwag accuracy, while incurring a modest increase in perplexity.
Efficiency advantages
Because Next‑ToBE does not add extra prediction heads, peak GPU memory usage is reduced by up to 68 % and training time by up to 15 % compared with representative MTP methods, while delivering superior performance.
Key insight
Standard NTP suppresses latent anticipatory information that already exists in the model’s distribution. By reshaping the supervision signal to include a weighted soft target over future tokens, Next‑ToBE re‑activates this anticipatory capacity, leading to higher FtHR, better long‑range reasoning, and more efficient training.
Paper title: Next-ToBE: Probabilistic Next Token-Bag Exploitation for Activating Anticipatory Capacity in LLMs
Paper link: https://openreview.net/pdf?id=T8IJojfaOhSigned-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Machine Learning Algorithms & Natural Language Processing
Focused on frontier AI technologies, empowering AI researchers' progress.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
