How USTC’s Tiny LCPO Training Cuts Large Model Overthinking in Half

The paper introduces LCPO, a lightweight preference‑optimization technique that uses only 800 training examples and 50 steps to teach large language models to produce concise, accurate answers, halving inference length while often improving accuracy and reducing training cost by up to two orders of magnitude.

Data Party THU
Data Party THU
Data Party THU
How USTC’s Tiny LCPO Training Cuts Large Model Overthinking in Half

Problem: Verbose reasoning in large models

Large reasoning models such as DeepSeek‑R1 and Qwen‑32B generate long chains of thought (CoT) even for simple questions, increasing compute cost and sometimes causing errors—a phenomenon called “overthinking”. Existing fixes either truncate at inference time (unstable) or rely on massive online reinforcement learning, which requires hundreds of thousands of examples and thousands of GPU hours.

Key observation: Models already contain a concise mode

Sampling 16 answers per question from DeepSeek‑R1‑Distill‑Qwen‑7B and sorting them by length shows that short answers (top‑ranked) retain almost the same accuracy as the full set, while long answers suffer a sharp drop. This indicates that the model can produce correct short reasoning if guided appropriately.

Length grouping experiment showing accuracy versus answer length
Length grouping experiment showing accuracy versus answer length

Method: Length Controlled Preference Optimization (LCPO)

LCPO follows a three‑step pipeline:

Data selection : Use the model’s own answer‑correctness as a difficulty label, dividing math problems into Easy (fully correct), Medium (partially correct), and Difficult (incorrect). Only the Easy subset is kept. Within each Easy problem, the shortest correct answer becomes the positive example and the longest answer becomes the negative example. From 22 k raw samples, only 800 are used for optimization.

Algorithmic innovation : Analysis of existing preference‑optimization objectives (DPO, SimPO, ORPO) reveals that the implicit negative‑log‑likelihood (NLL) term interferes with learning a length preference. LCPO directly balances the NLL influence, allowing the model to focus on length bias without extra hyper‑parameters.

Diagram of NLL loss interfering with length preference
Diagram of NLL loss interfering with length preference

Training efficiency

Compared with online RL baselines, LCPO reduces data requirements by one to two orders of magnitude and cuts total training time to about 10.4 A100‑GPU hours (versus thousands of hours for RL). No hyper‑parameter tuning is needed; the method works out‑of‑the‑box.

Resource comparison showing LCPO’s lower cost
Resource comparison showing LCPO’s lower cost

Empirical results

On DeepSeek‑R1‑Distill‑Qwen‑1.5B/7B the method halves the average generated token count while keeping accuracy essentially unchanged. On out‑of‑distribution benchmarks (MMLU, GPQA‑Diamond, WinoGrande) the model still reduces length by over 55 % and shows a modest accuracy gain, suggesting that LCPO learns a general “efficient reasoning” habit rather than task‑specific memorization.

Main experiment results showing length reduction and accuracy
Main experiment results showing length reduction and accuracy

A concrete case study on a simple algebra problem shows that before LCPO the model performed eight verification steps and spent many tokens; after LCPO it performed a single verification, cutting token usage by 79.37 % while still arriving at the correct answer.

Case study illustrating token reduction
Case study illustrating token reduction

Insights and outlook

The work demonstrates that large models inherently possess concise reasoning paths; the challenge is to surface them with lightweight preference signals. LCPO opens a new direction for low‑cost alignment of large models toward efficient behavior, promising faster API calls, lower inference costs, and fewer errors caused by overthinking.

Paper: https://arxiv.org/abs/2508.10164 (ICLR 2026). Code: https://github.com/SleepyWithoutCoffee/Small_Scale

Code example

本文
约1700字
,建议阅读
5
分钟
本文介绍了中科大 LCPO,轻量训练实现大模型精简推理且准确率提升。
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

large language modelsPreference OptimizationLow-Resource TrainingEfficient InferenceLCPO
Data Party THU
Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.