Can a New Training Objective Make LLMs See Further and Reason Better?

The paper introduces Next‑ToBE, a training‑objective modification that replaces the one‑hot next‑token label with a soft distribution covering a future token window, thereby activating latent anticipatory capacity in large language models and yielding significant gains in token‑hit rates, reasoning accuracy, and training efficiency.

Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Can a New Training Objective Make LLMs See Further and Reason Better?

Why the standard next‑token objective limits long‑range modeling

Large language models are trained with the standard next‑token prediction (NTP) objective, which supervises the model to predict only the immediate next token. This objective supports fluent text generation but hampers performance on tasks that require multi‑step planning such as mathematical reasoning, code generation, and long‑horizon decision making.

Future‑tokens Hit Rate (FtHR) metric

Even under NTP, the probability distribution at a given step assigns non‑negligible mass to tokens that appear later in the reference text. The authors introduce the Future‑tokens Hit Rate (FtHR) metric to measure how many future tokens are covered by the current distribution. Experiments show a clear correlation: the higher a future token ranks in the current distribution, the more likely it will be generated correctly later.

Figure 1: Current prediction already covers a substantial proportion of future tokens; higher rank predicts higher correctness
Figure 1: Current prediction already covers a substantial proportion of future tokens; higher rank predicts higher correctness

Next‑ToBE: exploiting anticipatory capacity via a soft target

Next‑ToBE keeps the original one‑hot next‑token label as the primary supervision term but augments it with a soft target distribution over a window of k‑1 future tokens. The loss is a weighted sum of the standard NTP loss and the future‑token soft loss, where the weight λ balances the two components.

The weighting of each future token is determined by two factors:

Model’s own prior preference: tokens that the model already assigns higher probability to receive more weight.

Temporal‑semantic relation: tokens closer in time and more semantically related to the current context receive higher weight.

This design aligns the auxiliary supervision with the model’s latent anticipatory structure rather than treating all future tokens equally.

Figure 2: Overall architecture of Next‑ToBE
Figure 2: Overall architecture of Next‑ToBE

How Next‑ToBE improves anticipatory ability

By retaining the next‑token head and adding the future‑window soft targets, the model is encouraged to distribute probability mass over several upcoming steps, leading to smoother and more consistent predictions. This contrasts with Multi‑token Prediction (MTP) methods that add extra heads for each future position.

Experimental validation

The authors evaluate three research questions: (1) Does Next‑ToBE enhance the model’s perception of future tokens? (2) Does this translate into more accurate downstream generation? (3) Does it improve complex reasoning tasks?

After fine‑tuning with Next‑ToBE, FtHR increases markedly, and the accuracy of generating k future steps improves, while the confidence of the immediate next token slightly decreases.

Figure 3: (a) Future token hit rate boost; (b) Autoregressive k‑step accuracy rise; (c) Slight drop in next‑token confidence
Figure 3: (a) Future token hit rate boost; (b) Autoregressive k‑step accuracy rise; (c) Slight drop in next‑token confidence

Downstream evaluation on three base models (Qwen2.5‑Math‑1.5B, Qwen2.5‑Math‑7B, Llama3.1‑8B‑Instruct) across 36 task settings (mathematical reasoning, code generation, commonsense reasoning) shows Next‑ToBE achieving the best result in 35 cases.

Table 1: Math reasoning comparison – Next‑ToBE attains highest average scores
Table 1: Math reasoning comparison – Next‑ToBE attains highest average scores
Table 2: Code generation and commonsense reasoning results
Table 2: Code generation and commonsense reasoning results

The study also examines the effect of the λ hyper‑parameter. As λ grows, next‑token confidence declines, yet task accuracy follows an inverted‑U shape: it first rises then falls, indicating that a moderate reduction in local certainty benefits long‑range reasoning.

Figure 4: Larger λ lowers next‑token confidence (left) but yields a rise‑then‑fall pattern in reasoning accuracy (center/right)
Figure 4: Larger λ lowers next‑token confidence (left) but yields a rise‑then‑fall pattern in reasoning accuracy (center/right)

From‑scratch training

Next‑ToBE also proves effective when applied during pre‑training. Experiments with GPT‑2 (124M) on WikiText‑103 demonstrate improvements in FtHR and HellaSwag accuracy, albeit with a slight increase in perplexity, confirming that anticipatory ability can be deliberately cultivated from the ground up.

Table 3: GPT‑2 pre‑training results – Next‑ToBE improves FtHR and HellaSwag accuracy
Table 3: GPT‑2 pre‑training results – Next‑ToBE improves FtHR and HellaSwag accuracy

Efficiency gains

Compared with representative MTP methods, Next‑ToBE reduces peak memory consumption by up to 68% and shortens training time by up to 15%, while delivering superior performance.

Table 4: Training time and peak memory comparison on Qwen2.5‑Math‑1.5B
Table 4: Training time and peak memory comparison on Qwen2.5‑Math‑1.5B

Paper reference

论文标题:Next-ToBE: Probabilistic Next Token-Bag Exploitation for Activating Anticipatory Capacity in LLMs
论文链接:https://openreview.net/pdf?id=T8IJojfaOh
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

large language modelsmodel efficiencyAnticipatory CapacityNext-ToBEToken PredictionTraining Objectives
Machine Learning Algorithms & Natural Language Processing
Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.