DeepSeek’s MCTS Failure: The ‘Roast Chicken and Baijiu’ Dilemma in LLM Training

The article examines why DeepSeek’s large‑model training cannot yet leverage Monte‑Carlo Tree Search, detailing its reliance on SFT, GRPO‑driven CoT activation and rejection‑sampling, contrasting this with Google’s PRM‑based approaches, and proposing a MCTS‑powered data‑generation pipeline to overcome the “roast chicken and baijiu” training dilemma.

GRPOLarge Language ModelsMonte Carlo Tree Search

0 likes · 14 min read

DeepSeek’s MCTS Failure: The ‘Roast Chicken and Baijiu’ Dilemma in LLM Training

AI2ML AI to Machine Learning

Dec 27, 2025 · Artificial Intelligence

Why Jeff Dean Champions Speculative Decoding: The Underlying Ideas

Jeff Dean highlighted speculative decoding as a lossless inference acceleration technique that can boost large language model throughput by 2–3×, and the article breaks down its core concepts—including parallel token verification, draft‑target model collaboration, rejection sampling theory, and practical optimizations such as continuous batching and tree‑based verification.

Draft-Target ModelInference AccelerationKV Cache

0 likes · 8 min read

Why Jeff Dean Champions Speculative Decoding: The Underlying Ideas

Baobao Algorithm Notes

Oct 22, 2024 · Artificial Intelligence

Uncovering Hidden Assumptions in RLHF: Theory, DPO & PPO Pitfalls

This article analytically explores the implicit assumptions behind the RLHF optimization objective, examines how they limit DPO and PPO methods, and proposes practical improvements such as rejection sampling and online on‑policy data selection to narrow the gap between theory and practice.

AI alignmentDPOPPO

0 likes · 22 min read

Uncovering Hidden Assumptions in RLHF: Theory, DPO & PPO Pitfalls

NewBeeNLP

Apr 1, 2024 · Artificial Intelligence

How Llama 2 Uses RLHF, PPO, Rejection Sampling, and Ghost Attention

This article provides a detailed technical walkthrough of Llama 2's Reinforcement Learning with Human Feedback pipeline, covering human preference data collection, reward‑model design and training, iterative fine‑tuning with PPO and rejection sampling, the Ghost Attention technique for multi‑turn consistency, and the resulting experimental evaluations.

Ghost AttentionLlama-2PPO

0 likes · 18 min read

How Llama 2 Uses RLHF, PPO, Rejection Sampling, and Ghost Attention

Hulu Beijing

Mar 8, 2018 · Artificial Intelligence

Master Common Sampling Techniques: Inverse Transform, Rejection, Importance & MCMC

This article explains the core ideas and step-by-step procedures of widely used sampling methods—including inverse transform, rejection, importance, and Markov Chain Monte Carlo techniques such as Metropolis‑Hastings and Gibbs—highlighting their mathematical foundations, practical implementations, and when each method is appropriate.

Importance SamplingMCMCMonte Carlo

0 likes · 11 min read

Master Common Sampling Techniques: Inverse Transform, Rejection, Importance & MCMC

DeepSeek’s MCTS Failure: The ‘Roast Chicken and Baijiu’ Dilemma in LLM Training

Why Jeff Dean Champions Speculative Decoding: The Underlying Ideas

Uncovering Hidden Assumptions in RLHF: Theory, DPO & PPO Pitfalls

How Llama 2 Uses RLHF, PPO, Rejection Sampling, and Ghost Attention

Master Common Sampling Techniques: Inverse Transform, Rejection, Importance & MCMC

How Llama 2 Uses RLHF, PPO, Rejection Sampling, and Ghost Attention