Artificial Intelligence 12 min read

Next-ToBE: Enabling Overconfident LLMs to See Further and Reason More Accurately

The ICLR 2026 paper introduces Next‑ToBE, a training‑objective modification that replaces the one‑hot next‑token label with a soft distribution over a future token window, unlocking latent foresight in LLMs, improving future‑token hit rate, downstream reasoning performance, and reducing training memory and time.

Machine Learning Algorithms & Natural Language Processing

May 25, 2026

Next-ToBE: Enabling Overconfident LLMs to See Further and Reason More Accurately

Limitations of standard next‑token prediction

Large language models are typically trained with the next‑token prediction (NTP) objective, which supervises only the immediate next token. This objective yields fluent text generation but limits performance on tasks that require long‑range planning such as mathematical reasoning, code generation, and multi‑step decision making.

Future‑tokens Hit Rate (FtHR) metric

To quantify how much information about future tokens is already present in the current prediction distribution, the authors introduce the Future‑tokens Hit Rate (FtHR) metric. Experiments show that even under standard NTP the distribution covers a substantial portion of future tokens, and a higher rank of a future token in the current distribution correlates with a higher probability of being generated correctly later.

Figure 1: (left) current step distribution already covers a large proportion of future tokens; (right) higher rank of a future token in the current distribution increases its chance of correct later generation.

Next‑ToBE training objective

Next‑ToBE keeps the original one‑hot next‑token loss as the primary term and adds a soft target distribution over a window of k future tokens. The auxiliary loss weight for each future token is determined by two factors:

Model’s prior preference : tokens that already receive higher probability from the model are given larger weight.

Temporal‑semantic relationship : tokens that are temporally closer to the current step and semantically more related to the context receive higher weight.

The combined loss is L = L_{next} + λ·L_{future}, where λ balances the primary next‑token term and the auxiliary future‑token term. No additional prediction heads are introduced, so the model architecture remains unchanged.

Figure 2: Overall architecture of Next‑ToBE.

Experimental setup

Fine‑tuning experiments were conducted on three base models:

Qwen2.5‑Math‑1.5B

Qwen2.5‑Math‑7B

Llama3.1‑8B‑Instruct

Each model was evaluated on three downstream task families: mathematical reasoning, code generation, and commonsense inference. For each task family, multiple benchmarks were used, yielding a total of 36 paired evaluations.

Results: foresight gains and downstream performance

After Next‑ToBE fine‑tuning, FtHR increased sharply and multi‑step generation accuracy improved in tandem, while next‑token confidence showed a modest decline.

Figure 3: (a) FtHR rise; (b) k‑step generation accuracy rise; (c) slight drop in next‑token confidence.

Across the 36 paired evaluations, Next‑ToBE achieved the best result in 35 cases.

Table 1: Mathematics reasoning results – Next‑ToBE obtains the highest average score on all three bases.

Table 2: Code generation and commonsense inference results.

Effect of the λ coefficient

Increasing λ reduces next‑token confidence but yields a non‑monotonic effect on task accuracy: accuracy first rises, then falls, forming an inverted‑U shape. This indicates that moderate encouragement of foresight improves reasoning without overly sacrificing certainty.

Figure 4: Left – next‑token confidence drops with larger λ; middle/right – task accuracy first increases then decreases.

Pre‑training from scratch

Next‑ToBE was also applied during pre‑training of GPT‑2 (124 M) on WikiText‑103. Compared with standard NTP and a representative Multi‑token Prediction (MTP) baseline, Next‑ToBE improved FtHR and HellaSwag accuracy, while incurring a modest increase in perplexity.

Table 3: GPT‑2 (124 M) pre‑trained on WikiText‑103 – Next‑ToBE improves FtHR and HellaSwag accuracy; perplexity slightly higher.

Efficiency advantages

Because Next‑ToBE does not add extra prediction heads, peak GPU memory usage is reduced by up to 68 % and training time by up to 15 % compared with representative MTP methods, while delivering superior performance.

Table 4: Training time and peak memory comparison on Qwen2.5‑Math‑1.5B.

Key insight

Standard NTP suppresses latent anticipatory information that already exists in the model’s distribution. By reshaping the supervision signal to include a weighted soft target over future tokens, Next‑ToBE re‑activates this anticipatory capacity, leading to higher FtHR, better long‑range reasoning, and more efficient training.

Paper title: Next-ToBE: Probabilistic Next Token-Bag Exploitation for Activating Anticipatory Capacity in LLMs
Paper link: https://openreview.net/pdf?id=T8IJojfaOh

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

large language models model efficiency Next-ToBE Training Objectives Future Token Prediction Reasoning Performance

Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.