Artificial Intelligence 12 min read

KwaiCoder-AutoThink-preview: An Automatic‑Thinking Large Model Enhanced with Step‑SRPO Reinforcement Learning

The KwaiPilot team released the KwaiCoder‑AutoThink‑preview model, which introduces a novel automatic‑thinking training paradigm and a process‑supervised reinforcement‑learning method called Step‑SRPO, enabling the model to dynamically switch between thinking and non‑thinking modes, reduce inference cost, and achieve up to 20‑point gains on code and math benchmarks while handling large‑scale codebases.

Kuaishou Tech
Kuaishou Tech
Kuaishou Tech
KwaiCoder-AutoThink-preview: An Automatic‑Thinking Large Model Enhanced with Step‑SRPO Reinforcement Learning

Recently, the KwaiPilot team at Kuaishou open‑sourced the KwaiCoder‑AutoThink‑preview large model, addressing the "over‑thinking" problem of recent deep‑thinking models. The team proposes a new automatic‑thinking training paradigm and, based on the traditional GRPO reinforcement‑learning algorithm, introduces a process‑supervised method called Step‑SRPO to further improve performance on complex tasks.

The model combines "thinking" and "non‑thinking" abilities and can automatically switch its reasoning mode according to problem difficulty. This training yields performance improvements across multiple evaluation leaderboards, with code and mathematics tasks seeing score increases of roughly 20 points when the automatic‑thinking mode is enabled. Even without the thinking mode, the model benefits from a better inference style.

To mitigate the high inference cost of deep‑thinking models, the team designed a pre‑think stage that judges problem difficulty before deciding whether to engage in extensive reasoning. This enables deep exploration for hard problems while providing direct answers for easy ones, balancing cost and performance for high‑traffic C‑end services.

The "Cold Start" data is generated via an Agentic pipeline that creates diverse long‑ and short‑thinking examples, allowing the model to learn when to think. The two‑step training first uses this cold‑start data for a pre‑think capability, then applies the Step‑SRPO algorithm—an enhanced version of GRPO with intermediate process supervision—to refine the model’s decision‑making and reduce unnecessary token generation.

Training details include dynamic context‑length adjustment (starting from 16K and expanding to 32K), large batch sizes, off‑policy updates, and entropy‑based KL‑loss scaling to balance exploration and exploitation. These techniques lead to over 90% R1 performance on difficult benchmarks and 10‑30% gains on simpler ones.

Empirical results show the model achieving up to 20‑point improvements on several code and math leaderboards, successfully generating realistic Python programs (e.g., a ball bouncing inside a rotating hexagon) and handling multi‑turn dialogues for more complex scenarios.

In large‑scale real‑world tests, the model was applied to a 600k‑line backend codebase, automatically generating end‑to‑end solutions for a demanding double‑column switching feature, demonstrating the ability to navigate and modify thousands of files with high syntactic and semantic correctness.

Future work includes model distillation combined with Multi‑Token Prediction (MTP) to achieve state‑of‑the‑art performance at 1/30 of the training cost, with plans to open‑source the full technical report and detailed training methodology.

For more details and to access the preview weights, visit the HuggingFace repository: https://huggingface.co/Kwaipilot/KwaiCoder-AutoThink-preview .

code generationmodel optimizationlarge language modelreinforcement learningAI researchautomatic thinking
Kuaishou Tech
Written by

Kuaishou Tech

Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.