Why a Robust Training Pipeline Beats Fancy LLM Tricks – Lessons from DAPO

The article analyzes the DAPO technical report, showing how dynamic‑sampling pipelines and token‑level loss handling in SFT and RL training outperform ad‑hoc algorithm tricks, and compares the training dynamics of reinforce_baseline and GRPO with concrete code examples.

Dynamic SamplingGRPOLLM

0 likes · 8 min read

Why a Robust Training Pipeline Beats Fancy LLM Tricks – Lessons from DAPO

Architect

Feb 20, 2025 · Artificial Intelligence

Why Long CoT and In‑Context RL Are the Next Frontier for LLMs

The article analyses recent breakthroughs such as OpenAI's o1, Long CoT, and test‑time search, arguing that enabling LLMs to perform self‑critique and reinforcement learning with long output sequences is essential for future AI performance, while warning against overly structured workflows.

AI researchIn‑Context RLLLM

0 likes · 12 min read

Why Long CoT and In‑Context RL Are the Next Frontier for LLMs

Code DAO

Dec 3, 2021 · Artificial Intelligence

Understanding Actor‑Critic and A2C: From Policy Gradients to REINFORCE in RL

This article derives the policy‑gradient objective for discrete actions, implements the Monte‑Carlo REINFORCE algorithm in PyTorch, explains the actor‑critic framework, introduces Advantage Actor‑Critic (A2C) versus A3C, and demonstrates their performance on the OpenAI Gym CartPole‑v0 environment.

A2COpenAI GymPolicy Gradient

0 likes · 13 min read

Understanding Actor‑Critic and A2C: From Policy Gradients to REINFORCE in RL