Artificial Intelligence 6 min read

Why Can’t LLMs Directly Copy AlphaGo’s MCTS Success?

The article analyzes why large language models cannot simply adopt AlphaGo’s Monte‑Carlo Tree Search, highlighting credit‑assignment difficulties, gradient‑variance explosion in multi‑step RL, and how AlphaGo’s tight integration of value and policy networks amortizes search in a way LLMs cannot replicate.

Machine Heart

May 23, 2026

Why Can’t LLMs Directly Copy AlphaGo’s MCTS Success?

Eric Jang, former AI VP at 1X Technologies and ex‑DeepMind robotics scientist, explains that AlphaGo’s breakthrough stems from tightly coupling a shallow neural network with Monte‑Carlo Tree Search (MCTS). In each search cycle the value network predicts win probabilities to truncate depth, while the policy network suggests high‑potential moves to prune breadth, allowing the exhaustive tree to collapse into a single forward pass.

In contrast, current large language models (LLMs) rely on policy‑gradient reinforcement learning, which suffers from severe credit‑assignment problems and quadratic growth of gradient variance with trajectory length T. When attempting multi‑step RL, the variance term scales as O(T²), and assigning rewards to individual tokens creates complex interactions that further inflate variance.

To avoid these issues, LLM training typically treats an entire generated sequence as a single action (T=1), aggregating the log‑probability of the whole sequence. Even with this simplification, the naive REINFORCE estimator retains high variance, forcing practitioners to use millions of samples to extract useful supervision from neutral labels.

AlphaGo sidesteps credit‑assignment by using MCTS to generate improved action labels for every visited state, effectively acting as a superior teacher. This “search‑as‑teacher” approach, also employed in methods like Neural Fictitious Self‑Play (NFSP) and Q‑learning, replaces weak long‑trajectory credit signals with strong supervised targets derived from better planning.

Because language generation has an astronomically larger and open‑ended search space than Go, directly transplanting MCTS is infeasible. Jang suggests that LLMs must internalize search through compute substitution—embedding complex reasoning into the forward pass—rather than relying on explicit tree structures.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LLM reinforcement learning Policy Gradient AlphaGo MCTS Credit Assignment

Written by

Machine Heart

Professional AI media and industry service platform

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.