How USTC’s Tiny LCPO Training Cuts Large Model Overthinking in Half
The paper introduces LCPO, a lightweight preference‑optimization technique that uses only 800 training examples and 50 steps to teach large language models to produce concise, accurate answers, halving inference length while often improving accuracy and reducing training cost by up to two orders of magnitude.
Problem: Verbose reasoning in large models
Large reasoning models such as DeepSeek‑R1 and Qwen‑32B generate long chains of thought (CoT) even for simple questions, increasing compute cost and sometimes causing errors—a phenomenon called “overthinking”. Existing fixes either truncate at inference time (unstable) or rely on massive online reinforcement learning, which requires hundreds of thousands of examples and thousands of GPU hours.
Key observation: Models already contain a concise mode
Sampling 16 answers per question from DeepSeek‑R1‑Distill‑Qwen‑7B and sorting them by length shows that short answers (top‑ranked) retain almost the same accuracy as the full set, while long answers suffer a sharp drop. This indicates that the model can produce correct short reasoning if guided appropriately.
Method: Length Controlled Preference Optimization (LCPO)
LCPO follows a three‑step pipeline:
Data selection : Use the model’s own answer‑correctness as a difficulty label, dividing math problems into Easy (fully correct), Medium (partially correct), and Difficult (incorrect). Only the Easy subset is kept. Within each Easy problem, the shortest correct answer becomes the positive example and the longest answer becomes the negative example. From 22 k raw samples, only 800 are used for optimization.
Algorithmic innovation : Analysis of existing preference‑optimization objectives (DPO, SimPO, ORPO) reveals that the implicit negative‑log‑likelihood (NLL) term interferes with learning a length preference. LCPO directly balances the NLL influence, allowing the model to focus on length bias without extra hyper‑parameters.
Training efficiency
Compared with online RL baselines, LCPO reduces data requirements by one to two orders of magnitude and cuts total training time to about 10.4 A100‑GPU hours (versus thousands of hours for RL). No hyper‑parameter tuning is needed; the method works out‑of‑the‑box.
Empirical results
On DeepSeek‑R1‑Distill‑Qwen‑1.5B/7B the method halves the average generated token count while keeping accuracy essentially unchanged. On out‑of‑distribution benchmarks (MMLU, GPQA‑Diamond, WinoGrande) the model still reduces length by over 55 % and shows a modest accuracy gain, suggesting that LCPO learns a general “efficient reasoning” habit rather than task‑specific memorization.
A concrete case study on a simple algebra problem shows that before LCPO the model performed eight verification steps and spent many tokens; after LCPO it performed a single verification, cutting token usage by 79.37 % while still arriving at the correct answer.
Insights and outlook
The work demonstrates that large models inherently possess concise reasoning paths; the challenge is to surface them with lightweight preference signals. LCPO opens a new direction for low‑cost alignment of large models toward efficient behavior, promising faster API calls, lower inference costs, and fewer errors caused by overthinking.
Paper: https://arxiv.org/abs/2508.10164 (ICLR 2026). Code: https://github.com/SleepyWithoutCoffee/Small_Scale
Code example
本文
约1700字
,建议阅读
5
分钟
本文介绍了中科大 LCPO,轻量训练实现大模型精简推理且准确率提升。Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Data Party THU
Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
