Two‑Stage History‑Resampling Policy Optimization (SRPO) for Large‑Scale LLM Reinforcement Learning
The article introduces SRPO, a two‑stage history‑resampling reinforcement‑learning framework that systematically tackles common GRPO training issues and achieves state‑of‑the‑art performance on both math and code benchmarks with far fewer training steps, while also revealing emergent self‑reflection behaviors in large language models.
Recent successes of large‑scale reinforcement learning for large language models (e.g., OpenAI o1, DeepSeek‑R1) demonstrate its effectiveness, but training still suffers from performance bottlenecks, low sample‑utilization efficiency, and difficulty handling mixed math‑and‑code data.
To address these challenges, the Kwaipilot team proposes SRPO (Two‑Stage History‑Resampling Policy Optimization). Stage 1 concentrates on challenging mathematical data to elicit strong reasoning abilities; Stage 2 introduces code data to integrate programming skills while preserving the reasoning foundation built in Stage 1.
The core innovation is History Resampling. After each epoch the system records all rollout rewards, discards overly easy samples that provide no gradient signal, and retains informative or hard samples (including those that were initially all‑wrong). This increases reward variance and ensures effective gradient updates.
Reward design combines a format reward (ensuring the answer follows a strict JSON‑like format), an accuracy reward (math verification or code correctness), and a penalty for mixed‑language output. Training uses the Qwen‑2.5‑Base‑32B checkpoint, AdamW optimizer (β = [0.9, 0.95]), a constant 1e‑6 learning rate, vLLM rollouts (256 prompts × 32 rollouts), token‑level loss, and removes the KL term to encourage exploration.
Experimental results show SRPO‑Qwen‑32B surpasses DeepSeek‑R1‑Zero‑32B on AIME24 (score 50) and LiveCodeBench (score 41.6) while using only one‑tenth of the training steps. Analyses also reveal emergent self‑reflection patterns—recheck, hesitation, and exploration—demonstrating the model’s ability to self‑verify and iteratively improve its solutions.
The paper concludes that SRPO offers a scalable, cross‑domain reinforcement‑learning framework for LLMs, and the authors release the SRPO‑Qwen‑32B model and code to facilitate further research.
Kuaishou Tech
Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.