How Bengio’s TBA Decouples Sampling and Learning to Speed Up LLM RL by 50×

The article explains how large‑language‑model post‑training suffers from rollout bottlenecks, introduces the Trajectory Balance with Asynchrony (TBA) framework that separates a Searcher from a Trainer, reuses off‑policy trajectories via a Trajectory Balance objective, and demonstrates up to 50× speed‑ups while preserving or improving performance on math reasoning, preference fine‑tuning, and automated red‑team tasks.

Asynchronous TrainingLLMLarge Models

0 likes · 9 min read

How Bengio’s TBA Decouples Sampling and Learning to Speed Up LLM RL by 50×

Machine Learning Algorithms & Natural Language Processing

May 12, 2026 · Artificial Intelligence

Breaking Off‑Policy Shift: Bengio’s TBA Decouples Sampling and Learning for 50× Faster LLM RL

Trajectory Balance with Asynchrony (TBA) separates sample generation (Searcher) from model updates (Trainer), uses a trajectory‑balance objective to incorporate off‑policy data, and achieves up to 50× speedup in large‑model RL post‑training while preserving or improving performance on math reasoning, preference fine‑tuning, and red‑team tasks.

Asynchronous TrainingLLMOff-Policy

0 likes · 10 min read

Breaking Off‑Policy Shift: Bengio’s TBA Decouples Sampling and Learning for 50× Faster LLM RL

AI Explorer

Mar 6, 2026 · Artificial Intelligence

AReaL: Lightning‑Fast Asynchronous RL Engine for Building High‑Performance LLM Agents

AReaL, an open‑source, fully asynchronous reinforcement‑learning platform co‑developed by Tsinghua University and Ant Group, dramatically speeds up training of complex LLM agents, offering a simple, stable, and hardware‑flexible solution for developers seeking industrial‑grade AI agents.

AI infrastructureAReaLAsynchronous Training

0 likes · 7 min read

AReaL: Lightning‑Fast Asynchronous RL Engine for Building High‑Performance LLM Agents

Baobao Algorithm Notes

Feb 24, 2026 · Artificial Intelligence

The Bitter Lesson of Building Agentic RL in Terminal Environments

This article recounts the challenges of moving from single‑step RL with verifiable rewards to multi‑step agentic reinforcement learning in terminal environments, detailing infrastructure design, asynchronous pipelines, data quality checks, masking strategies, curriculum training, chunk‑based optimization, and practical lessons learned from large‑scale experiments.

Asynchronous TrainingCredit AssignmentEnvironment Augmentation

0 likes · 33 min read

The Bitter Lesson of Building Agentic RL in Terminal Environments

Alimama Tech

Nov 11, 2025 · Artificial Intelligence

Accelerating LLM RL with Async Training, Mini‑Critics, and Attention Rewards

This article introduces the 3A collaborative framework—Async architecture, Asymmetric PPO mini‑critics, and an attention‑based reasoning rhythm—demonstrating how decoupled, fine‑grained parallel training and structure‑aware reward allocation dramatically improve efficiency, scalability, and interpretability of reinforcement learning for large language models.

Asynchronous Trainingattention mechanismslarge language models

0 likes · 23 min read

Accelerating LLM RL with Async Training, Mini‑Critics, and Attention Rewards