Tagged articles

17 articles

Page 1 of 1

May 23, 2026 · Artificial Intelligence

Why Can’t LLMs Directly Copy AlphaGo’s MCTS Success?

The article analyzes why large language models cannot simply adopt AlphaGo’s Monte‑Carlo Tree Search, highlighting credit‑assignment difficulties, gradient‑variance explosion in multi‑step RL, and how AlphaGo’s tight integration of value and policy networks amortizes search in a way LLMs cannot replicate.

AlphaGoCredit AssignmentLLM

0 likes · 6 min read

Why Can’t LLMs Directly Copy AlphaGo’s MCTS Success?

Machine Heart

May 21, 2026 · Artificial Intelligence

Learning Adaptive Gaussian Sampling for 3D Generation: Density‑Sampled Gaussians (DeG) at SIGGRAPH 2026

The SIGGRAPH 2026 paper “Generative 3D Gaussians with Learned Density Control” introduces Density‑Sampled Gaussians (DeG), a differentiable framework that lets a model learn where to place Gaussian splats by sampling from a learned spatial density, enabling arbitrary‑budget, non‑uniform 3D representations with higher quality per cost.

3D Gaussian SplattingAdaptive SamplingDifferentiable Rendering

0 likes · 14 min read

Learning Adaptive Gaussian Sampling for 3D Generation: Density‑Sampled Gaussians (DeG) at SIGGRAPH 2026

Machine Heart

May 10, 2026 · Artificial Intelligence

Sutton’s New Intentional Updates: Solving Streaming RL’s Major Flaw with a 1967 Formula

The article reviews the recent Intentional Updates framework—co‑authored by Turing laureate Richard Sutton—that redefines step‑size in streaming reinforcement learning using a 1967 NLMS‑style formula, details its algorithmic design, experimental validation, and remaining challenges.

Policy GradientSuttonintentional updates

0 likes · 11 min read

Sutton’s New Intentional Updates: Solving Streaming RL’s Major Flaw with a 1967 Formula

Data Party THU

May 4, 2026 · Artificial Intelligence

Understanding the Mathematical Foundations of Reinforcement Learning

This article provides a concise overview of a ten‑chapter reinforcement‑learning textbook, outlining the progression from basic concepts such as states and rewards to advanced algorithms like policy gradients and actor‑critic methods, and explains how each chapter builds on the previous ones.

Bellman equationMonte CarloPolicy Gradient

0 likes · 11 min read

Understanding the Mathematical Foundations of Reinforcement Learning

Machine Learning Algorithms & Natural Language Processing

Feb 24, 2026 · Artificial Intelligence

From Traditional RL to LLM‑RL: Theory Derivation and Engineering Improvements

The article walks through the fundamentals of traditional policy‑gradient reinforcement learning, derives the Reinforce objective, maps its concepts to large‑language‑model RL, and then discusses practical engineering solutions such as GRPO, async rollout, importance‑sampling corrections, and token‑flow management for industrial‑scale training.

Async RolloutGRPOImportance Sampling

0 likes · 10 min read

From Traditional RL to LLM‑RL: Theory Derivation and Engineering Improvements

Data Party THU

Nov 23, 2025 · Artificial Intelligence

Can a Drone Learn to Land Itself? A Deep Reinforcement Learning Walkthrough

This article walks through the fundamentals of reinforcement learning, builds a custom drone‑landing simulation, defines state and action spaces, designs reward functions, implements a neural‑network policy with Bernoulli sampling, and trains it using REINFORCE with baseline techniques, while exposing common pitfalls such as reward‑cheating.

OpenAI GymPolicy GradientPython

0 likes · 22 min read

Can a Drone Learn to Land Itself? A Deep Reinforcement Learning Walkthrough

Data Party THU

Oct 31, 2025 · Artificial Intelligence

How SPG’s Sandwich Gradient Boosts Diffusion Language Models Across Four Benchmarks

The SPG algorithm introduces a sandwiched policy gradient that uses computable lower and upper evidence bounds to align reinforcement learning for discrete diffusion language models, achieving faster convergence, higher peaks, and lower variance on four major reasoning benchmarks.

Diffusion Language ModelEUBOPolicy Gradient

0 likes · 9 min read

How SPG’s Sandwich Gradient Boosts Diffusion Language Models Across Four Benchmarks

Baobao Algorithm Notes

Oct 31, 2025 · Artificial Intelligence

How Risk‑Sensitive Reinforcement Learning Improves LLM Pass@K Performance

This article analyzes why standard reinforcement learning can degrade Pass@K metrics after fine‑tuning large language models, introduces a risk‑sensitive RL objective that reshapes the advantage estimator, and demonstrates through bandit and mathematical‑reasoning experiments that the RS‑GRPO method consistently boosts diversity and overall Pass@K scores across multiple LLMs.

Exploration-ExploitationLLM fine-tuningPolicy Gradient

0 likes · 12 min read

How Risk‑Sensitive Reinforcement Learning Improves LLM Pass@K Performance

Amap Tech

May 19, 2025 · Artificial Intelligence

Group Policy Gradient: Direct Objective Optimization for Faster Reinforcement Learning

The article introduces Group Policy Gradient (GPG), a reinforcement‑learning framework that eliminates surrogate loss functions and critic models, directly optimizes the original objective, reduces bias and variance, and achieves state‑of‑the‑art performance on both single‑modal and multimodal tasks.

AI researchLLM fine-tuningPolicy Gradient

0 likes · 7 min read

Group Policy Gradient: Direct Objective Optimization for Faster Reinforcement Learning

Baobao Algorithm Notes

Mar 20, 2025 · Artificial Intelligence

Unlocking Large‑Scale Deep Reinforcement Learning: PPO, GAE, and PPG Deep Dive

This comprehensive guide examines large‑scale deep reinforcement learning, detailing policy‑gradient fundamentals, the mathematics of PPO and GAE, practical implementation tricks, reward and observation normalization, network initialization, and the newer Phasic Policy Gradient method, all supported by code snippets and key research references.

Algorithm OptimizationDeep RLGAE

0 likes · 19 min read

Unlocking Large‑Scale Deep Reinforcement Learning: PPO, GAE, and PPG Deep Dive

Baobao Algorithm Notes

Mar 19, 2025 · Artificial Intelligence

Why Does GRPO Loss Start at Zero and Grow During OpenR1 Training?

The article explains why the GRPO loss in OpenR1 and trl starts at zero and then rises, detailing the underlying KL‑divergence formulation, the single‑step update mechanism, and how gradients are preserved despite a zero scalar loss, with code examples from the trl implementation.

GRPOLoss InitializationOpenR1

0 likes · 5 min read

Why Does GRPO Loss Start at Zero and Grow During OpenR1 Training?

Baobao Algorithm Notes

Nov 18, 2024 · Artificial Intelligence

Demystifying Actor‑Critic and PPO: From Policy Gradients to Practical RL

This article provides a thorough, step‑by‑step explanation of reinforcement‑learning theory—covering policy‑based objectives, value‑function definitions, the derivation of policy gradients, actor‑critic architecture, advantage estimation, importance sampling, GAE, and the PPO algorithm—aimed at readers with little prior RL knowledge.

PPOPolicy Gradientactor-critic

0 likes · 31 min read

Demystifying Actor‑Critic and PPO: From Policy Gradients to Practical RL

Baidu Geek Talk

Aug 16, 2023 · Artificial Intelligence

Understanding Reinforcement Learning: From Basics to PPO and Policy Gradient

This article provides a comprehensive overview of reinforcement learning, covering fundamental concepts, differences from supervised learning, algorithm families, policy gradient methods, practical tricks like baselines and reward‑to‑go, and detailed explanations of TRPO and PPO variants with illustrative diagrams.

Machine LearningPPOPolicy Gradient

0 likes · 19 min read

Understanding Reinforcement Learning: From Basics to PPO and Policy Gradient

HomeTech

Nov 16, 2022 · Artificial Intelligence

Fundamentals and Policy Gradient Algorithms in Reinforcement Learning with Applications to Scene Text Recognition

This article introduces the basic concepts of reinforcement learning, derives model‑based and model‑free policy gradient methods—including vanilla policy gradient and Actor‑Critic—explains their mathematical foundations, and demonstrates their use in scene text recognition and image captioning tasks.

AIPolicy Gradientactor-critic

0 likes · 22 min read

Fundamentals and Policy Gradient Algorithms in Reinforcement Learning with Applications to Scene Text Recognition

DaTaobao Tech

Aug 18, 2022 · Artificial Intelligence

Introduction to Deep Reinforcement Learning: Theory, Algorithms, and Applications

This article introduces deep reinforcement learning by explaining its Markov decision process foundations, then categorizes the main algorithm families—value‑based methods like DQN, policy‑based approaches such as PG/DPG/DDPG, and actor‑critic techniques including A3C, PPO, and DDPG—detailing their architectures, training procedures, and key advantages.

DQNMDPPolicy Gradient

0 likes · 14 min read

Introduction to Deep Reinforcement Learning: Theory, Algorithms, and Applications

GuanYuan Data Tech Team

Jul 28, 2022 · Artificial Intelligence

Unlocking Reinforcement Learning: Core Concepts, Algorithms, and Real‑World Applications

This article introduces reinforcement learning by defining agents, environments, rewards, and policies, explains key concepts such as Markov Decision Processes and Bellman equations, and surveys major algorithms—including dynamic programming, Monte‑Carlo, TD learning, policy gradients, Q‑learning, DQN, and evolution strategies—while highlighting practical challenges and notable case studies like AlphaGo Zero.

Evolution StrategiesMDPMachine Learning

0 likes · 27 min read

Unlocking Reinforcement Learning: Core Concepts, Algorithms, and Real‑World Applications

Code DAO

Dec 3, 2021 · Artificial Intelligence

Understanding Actor‑Critic and A2C: From Policy Gradients to REINFORCE in RL

This article derives the policy‑gradient objective for discrete actions, implements the Monte‑Carlo REINFORCE algorithm in PyTorch, explains the actor‑critic framework, introduces Advantage Actor‑Critic (A2C) versus A3C, and demonstrates their performance on the OpenAI Gym CartPole‑v0 environment.

A2COpenAI GymPolicy Gradient

0 likes · 13 min read

Understanding Actor‑Critic and A2C: From Policy Gradients to REINFORCE in RL