Can a New Training Objective Make LLMs See Further and Reason Better?
The paper introduces Next‑ToBE, a training‑objective modification that replaces the one‑hot next‑token label with a soft distribution covering a future token window, thereby activating latent anticipatory capacity in large language models and yielding significant gains in token‑hit rates, reasoning accuracy, and training efficiency.
Why the standard next‑token objective limits long‑range modeling
Large language models are trained with the standard next‑token prediction (NTP) objective, which supervises the model to predict only the immediate next token. This objective supports fluent text generation but hampers performance on tasks that require multi‑step planning such as mathematical reasoning, code generation, and long‑horizon decision making.
Future‑tokens Hit Rate (FtHR) metric
Even under NTP, the probability distribution at a given step assigns non‑negligible mass to tokens that appear later in the reference text. The authors introduce the Future‑tokens Hit Rate (FtHR) metric to measure how many future tokens are covered by the current distribution. Experiments show a clear correlation: the higher a future token ranks in the current distribution, the more likely it will be generated correctly later.
Next‑ToBE: exploiting anticipatory capacity via a soft target
Next‑ToBE keeps the original one‑hot next‑token label as the primary supervision term but augments it with a soft target distribution over a window of k‑1 future tokens. The loss is a weighted sum of the standard NTP loss and the future‑token soft loss, where the weight λ balances the two components.
The weighting of each future token is determined by two factors:
Model’s own prior preference: tokens that the model already assigns higher probability to receive more weight.
Temporal‑semantic relation: tokens closer in time and more semantically related to the current context receive higher weight.
This design aligns the auxiliary supervision with the model’s latent anticipatory structure rather than treating all future tokens equally.
How Next‑ToBE improves anticipatory ability
By retaining the next‑token head and adding the future‑window soft targets, the model is encouraged to distribute probability mass over several upcoming steps, leading to smoother and more consistent predictions. This contrasts with Multi‑token Prediction (MTP) methods that add extra heads for each future position.
Experimental validation
The authors evaluate three research questions: (1) Does Next‑ToBE enhance the model’s perception of future tokens? (2) Does this translate into more accurate downstream generation? (3) Does it improve complex reasoning tasks?
After fine‑tuning with Next‑ToBE, FtHR increases markedly, and the accuracy of generating k future steps improves, while the confidence of the immediate next token slightly decreases.
Downstream evaluation on three base models (Qwen2.5‑Math‑1.5B, Qwen2.5‑Math‑7B, Llama3.1‑8B‑Instruct) across 36 task settings (mathematical reasoning, code generation, commonsense reasoning) shows Next‑ToBE achieving the best result in 35 cases.
The study also examines the effect of the λ hyper‑parameter. As λ grows, next‑token confidence declines, yet task accuracy follows an inverted‑U shape: it first rises then falls, indicating that a moderate reduction in local certainty benefits long‑range reasoning.
From‑scratch training
Next‑ToBE also proves effective when applied during pre‑training. Experiments with GPT‑2 (124M) on WikiText‑103 demonstrate improvements in FtHR and HellaSwag accuracy, albeit with a slight increase in perplexity, confirming that anticipatory ability can be deliberately cultivated from the ground up.
Efficiency gains
Compared with representative MTP methods, Next‑ToBE reduces peak memory consumption by up to 68% and shortens training time by up to 15%, while delivering superior performance.
Paper reference
论文标题:Next-ToBE: Probabilistic Next Token-Bag Exploitation for Activating Anticipatory Capacity in LLMs
论文链接:https://openreview.net/pdf?id=T8IJojfaOhSigned-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Machine Learning Algorithms & Natural Language Processing
Focused on frontier AI technologies, empowering AI researchers' progress.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
