A 2026 Survey of LLM‑Focused RL: From PPO to DPO, GRPO, and Multi‑Agent RL
The article reviews five years of LLM‑centric reinforcement learning, tracing the evolution from early Q‑learning to PPO, then to Direct Preference Optimization, Group Relative Policy Optimization, and finally multi‑agent RL, detailing each method’s mechanics, strengths, failure modes, practical considerations, and emerging open‑source toolchains.
Brief Timeline
1989 – Q‑learning introduced, foundational for value‑based RL.
1992 – REINFORCE introduced, foundational for policy‑gradient RL.
2013‑2015 – DQN surpasses human performance on Atari, marrying deep learning with RL.
2016 – AlphaGo defeats Lee Sedol.
2017 – PPO (Proximal Policy Optimization) published by OpenAI and becomes the default RL algorithm for several years.
2017 – AlphaZero demonstrates self‑play without human data.
2022 – InstructGPT adapts PPO to fine‑tune language models with human preferences; ChatGPT launches shortly after.
PPO + RLHF Pipeline
SFT – Fine‑tune the base model on a small set of human‑written demonstrations.
Reward Model (RM) – Collect pairwise preference data ("which of these two outputs is better") and train a model r(x, y) to predict the preference.
PPO – Treat the RM as the environment, sample responses from the policy, score them with the RM, and update the policy with PPO while adding a KL penalty to keep the policy close to the SFT baseline.
The objective maximized by the policy is expected reward − β·KL , where the KL coefficient β is the most frequently tuned hyper‑parameter. The KL term prevents collapse to a high‑reward but nonsensical distribution.
InstructGPT showed that a 1.3 B PPO‑fine‑tuned model can be preferred over a 175 B GPT‑3 baseline.
Practical Issues with PPO + RLHF
Four models in GPU memory – policy, frozen reference policy, reward model, and value (critic) network. A 70 B policy plus optimizer state occupies roughly 280 B parameters.
Reward hacking – The policy exploits any systematic bias in the RM (e.g., long answers, bullet points, markdown headings) if the RM associates those patterns with high reward.
Distribution drift – The RM is trained on the original SFT outputs; as the policy moves forward the RM becomes less reliable, a degradation not visible on the loss curve.
Hyper‑parameter fragility – Clipping ratio, KL coefficient, value‑loss weight, learning rate, group size, rollout batch size; mis‑tuning any of them silently degrades training.
PPO + RLHF is powerful but essentially a pipeline; its cost is mainly engineering, not mathematical.
Scenarios where PPO still makes sense (given sufficient GPU budget and a high‑quality reward model or validator):
Exploration‑heavy tasks such as mathematics, code generation, or long‑range reasoning.
Availability of a robust RM or a trustworthy validator.
GPU memory sufficient to hold the four models simultaneously.
ICML 2024 paper Is DPO Superior to PPO for LLM Alignment? reported that with equal data quality PPO outperforms DPO by ~2.5 % on math benchmarks and ~1.2 % on general benchmarks.
Direct Preference Optimization (DPO)
Rafailov et al. (2023) derived a closed‑form relationship between the optimal policy and an implicit reward under the standard RLHF assumption (Bradley–Terry preference model with KL regularization). This allows merging the reward‑model learning and PPO steps into a single supervised loss on preference triples (prompt, chosen, rejected).
The DPO loss is a cross‑entropy applied to the logit difference between the chosen and rejected responses; no separate reward model, rollout, critic, or PPO loop is required.
Running DPO requires:
A frozen copy of the SFT model as a reference.
A trainable policy initialized from the same SFT checkpoint.
A dataset of triples (prompt, chosen, rejected).
2–4× lower compute cost than PPO because rollouts are unnecessary.
Training behaves like standard fine‑tuning; loss curves are directly observable.
Style shaping (e.g., longer answers, bullet points) emerges as a side effect when the reward is binary.
The KL coefficient β remains critical; typical values lie in the 0.1–0.5 range.
Iterative DPO (re‑sampling preferences with the updated policy) yields substantially better results than a single pass.
DPO does not explore. If the correct answer never appears in the dataset, DPO cannot invent it, so it quickly hits a ceiling on discovery‑heavy tasks such as math, code, or agent trajectories.
Group Relative Policy Optimization (GRPO)
By 2024 the community shifted from "making models polite" to "making models think". Long‑chain reasoning and verifiable answer scoring became the new frontier, but the value network (critic) in PPO was an unwanted computational tax.
GRPO, introduced by DeepSeek and highlighted in DeepSeek‑R1, removes the learned value function entirely. For a given prompt x, a group of G rollouts y_1,…,y_G (typically G=8–64) is sampled. Each rollout is scored by a validator (not a learned RM). The advantage is computed by normalizing rewards within the group, effectively using the other samples as a baseline.
No critic – memory usage drops roughly by half; a 7 B model that required 16 H100 GPUs now fits on 8 H100 GPUs.
Natural fit for binary verifiable rewards – when the reward is 0/1 (incorrect/correct), group whitening provides a clean contrast signal.
Advantage stability – group normalization mitigates reward‑scale issues.
Works well for inference‑heavy tasks (long‑chain reasoning, large G, strong validator) as seen in DeepSeek‑R1, Qwen, OLMo 3, and many fine‑tuned variants.
Common pitfalls for first‑time GRPO users:
Group size G – Larger G reduces advantage variance but linearly increases rollout cost; public settings often use G=16–32.
All‑zero or all‑one groups – If every sample succeeds or fails, the standard deviation is zero, causing exploding or vanishing advantages; add a small epsilon to the denominator and filter degenerate prompts.
KL coefficient – Setting β too low lets the policy drift into incoherent language; DeepSeek typically uses β=0.001–0.04 depending on the training stage.
Reward shape – Binary vs. dense rewards lead to very different behaviours; choose deliberately.
Variants such as DAPO, GSPO, Dr. GRPO, etc., are minor tweaks; the core idea remains using a rollout group as the baseline.
From Preference to Verifiable Rewards
PPO + RLHF (2022‑2023) – Reward from a human‑trained RM; captures human preference; failure modes include flattery and reward hacking; bottleneck is human annotators.
DPO (2023‑2024) – Reward directly applied to preference pairs; captures the same signal without an RM; failure mode is lack of exploration; bottleneck is preference‑data quality.
GRPO + RLVR (2024‑2026) – Reward from a validator (test runner, math checker, regex, judge); captures provable correctness; failure modes include validator hacking and capability tunnel‑vision; bottleneck is validator design.
The dominant paradigm today is Reinforcement Learning with Verifiable Rewards (RLVR). Signals are binary test passes or exact‑match checks rather than subjective scores.
2025 work on Binary Flexible Feedback (RLBFF) extracts verifiable binary principles from natural‑language feedback, aiming to combine RLHF’s coverage with RLVR’s precision.
Process‑vs‑Result Rewards
Result Reward Model (ORM) – Each rollout receives a scalar (often binary) attached to the final answer (e.g., unit‑test pass, exact match).
Process Reward Model (PRM) – Each intermediate step receives a score, typically via a classifier trained on step‑level human annotations.
From a credit‑allocation perspective, PRM is superior for long chains because a single mistake can be locally corrected. However, ORM is simpler and works well when the final answer provides a clear binary signal.
OpenAI’s Let’s Verify Step by Step (2023) showed that a PRM trained on millions of annotated math steps outperforms ORM on best‑of‑N sampling for MATH problems. DeepSeek‑R1 later found PRM hard to scale and reverted to result‑only rewards, yet still exhibited strong stepwise reasoning.
LLM‑Focused Multi‑Agent RL (MARL)
Self‑play on reasoning tasks is the cleanest MARL formulation: a model repeatedly plays against an improving copy, learning without human supervision.
SPIRAL – Trains a single LLM via multi‑round zero‑sum games (tic‑tac‑toe, Kuhn Poker, simple negotiation); reports up to 10 % gains on eight reasoning benchmarks.
SAGE – Runs four collaborative roles (Challenger, Planner, Solver, Critic) with minimal seed data; yields +8.9 % on LiveCodeBench and +10.7 % on OlympiadBench.
Agent Q‑Mix – Applies QMIX‑style value decomposition to a team of agents; achieves 20.8 % improvement on Humanity’s Last Exam with Gemini‑3.1‑Flash‑Lite.
Credit allocation is the central challenge: a team‑level reward (1 or 0) must be attributed to individual agents or steps. Three practical levers are commonly used:
Process rewards (PRM‑style) – train a validator that scores each agent’s contribution.
Value decomposition (VDN/QMIX/COMA family) – learn a joint value function and decompose it into per‑agent contributions.
Trajectory decomposition (LightningRL) – treat the multi‑agent system as a POMDP and propagate advantage through the trajectory graph.
Pure result‑only MARL is safe only when the team is tiny (2–3 agents), trajectories are short, and enough team‑level rollouts are collected to statistically separate contributions.
Training Real Agents: Framework Landscape
Most production agents are assembled from frameworks such as LangChain, AutoGen, CrewAI, or Microsoft’s Agent Framework. Re‑writing these agents to fit a GRPO loop is undesirable, prompting the emergence of two open‑source stacks.
Idea 1 – Framework‑agnostic, observability‑driven (Agent‑Lightning)
Algorithm – Decides tasks and learns from results; supports RL, automatic prompt optimization (APO), and SFT.
Runner – Executes the agent using the existing framework unchanged.
LightningStore – Shared storage and message queue that coordinates algorithm and runner.
LightningRL provides hierarchical credit allocation for multi‑step trajectories, allowing selective optimization of individual agents within a multi‑agent system.
Idea 2 – Step‑level MDP, end‑to‑end focus (Agent‑R1)
Agent‑R1 (University of Science and Technology of China, March 2026, v0.1.0, ~1.4k GitHub stars) treats each interaction step as a first‑class RL transition with its own state, action, and observation.
Native support for process rewards, combined with PRIME‑style reward normalization, enabling clean mixing of process and result rewards.
A custom optimizer pipeline that has already spawned algorithms such as PaperScout’s PSPO (Proximal Sequence Policy Optimization), aligning token‑level optimization with sequence‑level agent interaction.
Agent‑R1 builds on the verl distributed training engine (ByteDance) and powers agents like PaperScout (academic search), TableMind (tool‑augmented table reasoning), and Cast‑R1 (agent‑style time‑series forecasting).
Choosing Between Stacks and Methods
Polite instruction following → SFT → DPO (cheap, stable, good for style/voice).
Refusal or safety behavior → DPO (preference pairs fit naturally).
Boosting math, code, or logical reasoning → GRPO + RLVR with result rewards (avoids costly step‑level annotation).
Training a tool‑oriented agent from scratch → GRPO with either Agent‑Lightning (if you already use LangChain/AutoGen) or Agent‑R1 (if you want step‑level MDP and native process rewards).
Want PPO‑style exploration but lack a critic → GRPO (half the memory).
Have a strong RM and large GPU budget → PPO (still best for some hard tasks).
Multi‑role workflows (planner/solver/critic) → start with single‑agent, then move to MARL; allocate credit via PRM, value decomposition, or trajectory decomposition.
Future Directions
RLHF will persist as a thin, specialized layer for style, tone, and refusal behavior, while most alignment work migrates to verifiable rewards.
Validator engineering will become its own discipline (sandbox engineers, judge designers, scoring calibrators).
Language‑model AlphaZero will materialize: strong foundation + self‑play + validator + tree search.
Long‑horizon agent RL (multi‑day browsing, coding, experimentation loops) will be the next leap, requiring RLVR on full trajectories.
Open‑source stacks (TRL, OpenRLHF, verl, Open‑Instruct, Agent‑Lightning, Agent‑R1, RAGEN, MARTI, FlexMARL) will continue narrowing the gap with well‑funded labs.
Reward hacking will become the central alignment challenge as models outpace imperfect validators.
Conclusion
Over the past five years the community has systematically removed components:
TRPO removed fragility.
PPO removed second‑order math.
DPO removed the reward model.
GRPO removed the critic.
Result rewards removed the need for step‑level annotation in single‑agent settings.
Agent‑level RL frameworks removed the need to rewrite agents for training.
MARL is removing static environments.
The remaining ingredients are a learner, a set of peer learners, and a verifiable signal. If RL is treated as a side‑track to pre‑training and fine‑tuning, the next breakthrough in LLM capability will come not from larger Transformers but from smarter training loops built around them.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DeepHub IMBA
A must‑follow public account sharing practical AI insights. Follow now. internet + machine learning + big data + architecture = IMBA
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
