Artificial Intelligence 12 min read

Two‑Stage History‑Resampling Policy Optimization (SRPO) for Large‑Scale LLM Reinforcement Learning

The article introduces SRPO, a two‑stage history‑resampling reinforcement‑learning framework that systematically tackles common GRPO training issues and achieves state‑of‑the‑art performance on both math and code benchmarks with far fewer training steps, while also revealing emergent self‑reflection behaviors in large language models.

Kuaishou Tech

Apr 24, 2025

Two‑Stage History‑Resampling Policy Optimization (SRPO) for Large‑Scale LLM Reinforcement Learning

Recent successes of large‑scale reinforcement learning for large language models (e.g., OpenAI o1, DeepSeek‑R1) demonstrate its effectiveness, but training still suffers from performance bottlenecks, low sample‑utilization efficiency, and difficulty handling mixed math‑and‑code data.

To address these challenges, the Kwaipilot team proposes SRPO (Two‑Stage History‑Resampling Policy Optimization). Stage 1 concentrates on challenging mathematical data to elicit strong reasoning abilities; Stage 2 introduces code data to integrate programming skills while preserving the reasoning foundation built in Stage 1.

The core innovation is History Resampling. After each epoch the system records all rollout rewards, discards overly easy samples that provide no gradient signal, and retains informative or hard samples (including those that were initially all‑wrong). This increases reward variance and ensures effective gradient updates.

Reward design combines a format reward (ensuring the answer follows a strict JSON‑like format), an accuracy reward (math verification or code correctness), and a penalty for mixed‑language output. Training uses the Qwen‑2.5‑Base‑32B checkpoint, AdamW optimizer (β = [0.9, 0.95]), a constant 1e‑6 learning rate, vLLM rollouts (256 prompts × 32 rollouts), token‑level loss, and removes the KL term to encourage exploration.

Experimental results show SRPO‑Qwen‑32B surpasses DeepSeek‑R1‑Zero‑32B on AIME24 (score 50) and LiveCodeBench (score 41.6) while using only one‑tenth of the training steps. Analyses also reveal emergent self‑reflection patterns—recheck, hesitation, and exploration—demonstrating the model’s ability to self‑verify and iteratively improve its solutions.

The paper concludes that SRPO offers a scalable, cross‑domain reinforcement‑learning framework for LLMs, and the authors release the SRPO‑Qwen‑32B model and code to facilitate further research.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

reinforcement learning LLM optimization cross-domain training history resampling SRPO

Written by

Kuaishou Tech

Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.