How Far Can Unsupervised RL for Large Models Go? A Systematic Answer from a Tsinghua Team

The article analyzes the scaling limits of unsupervised reinforcement learning for large language models, revealing that intrinsic‑reward methods initially boost performance but inevitably collapse, proposes a unified theory and a model‑collapse metric to predict trainability, and argues that external‑reward approaches are the scalable path forward.

AI researchRL scalingexternal rewards

0 likes · 11 min read

How Far Can Unsupervised RL for Large Models Go? A Systematic Answer from a Tsinghua Team

Baobao Algorithm Notes

Dec 7, 2025 · Artificial Intelligence

Key Lessons from Scaling Agent RL Training: Stability, Tooling, and Reward Design

Over recent months of extensive agent reinforcement‑learning experiments across search, data‑analysis, and multi‑source scenarios, the author shares twelve practical insights covering stability, environment‑reward‑algorithm priorities, tool‑call reliability, reward hacking pitfalls, evaluation alignment, and scaling tricks for larger models.

PPO EWMARL scalingreinforcement learning

0 likes · 7 min read

Key Lessons from Scaling Agent RL Training: Stability, Tooling, and Reward Design

Architect

Feb 19, 2025 · Artificial Intelligence

Does Scaling Law Still Hold for Grok 3? A Deep Dive into LLM Training Economics

The article critically examines whether the pre‑training Scaling Law still applies to Grok 3, compares its compute usage and model size with DeepSeek and OpenAI models, evaluates the cost‑effectiveness of pre‑training, RL and test‑time scaling, and explores how these insights shape future large‑language‑model development strategies.

Grok-3Pre‑trainingRL scaling

0 likes · 11 min read

Does Scaling Law Still Hold for Grok 3? A Deep Dive into LLM Training Economics

Architect

Sep 28, 2024 · Artificial Intelligence

How Does OpenAI’s o1 Model Leverage Self‑Play RL and New Scaling Laws?

The article provides an in‑depth technical analysis of OpenAI’s multimodal o1 model, explaining its self‑play reinforcement‑learning pipeline, the novel train‑time and test‑time compute scaling laws, its long‑think reasoning abilities demonstrated through a cipher example, and speculative architectures for generator‑verifier systems.

OpenAIRL scalinginference

0 likes · 35 min read

How Does OpenAI’s o1 Model Leverage Self‑Play RL and New Scaling Laws?

How Far Can Unsupervised RL for Large Models Go? A Systematic Answer from a Tsinghua Team

Key Lessons from Scaling Agent RL Training: Stability, Tooling, and Reward Design

Does Scaling Law Still Hold for Grok 3? A Deep Dive into LLM Training Economics

How Does OpenAI’s o1 Model Leverage Self‑Play RL and New Scaling Laws?

Does Scaling Law Still Hold for Grok 3? A Deep Dive into LLM Training Economics