Tagged articles
7 articles
Page 1 of 1
PaperAgent
PaperAgent
May 14, 2026 · Artificial Intelligence

New Paradigm for LLM Alignment: Insights from Two Recent Anthropic Papers

Anthropic's two May papers reveal that simple SFT/RLHF is insufficient for safe LLMs; inserting a model‑spec mid‑training stage and synthetic‑document fine‑tuning dramatically reduces agentic misalignment, improves data efficiency, and enables models to reason about values before acting.

Agentic MisalignmentAnthropicLLM alignment
0 likes · 13 min read
New Paradigm for LLM Alignment: Insights from Two Recent Anthropic Papers
DataFunTalk
DataFunTalk
Apr 8, 2026 · Artificial Intelligence

Claude Mythos Preview Crushes Benchmarks and Reveals 27‑Year‑Old Zero‑Day

Anthropic's Claude Mythos Preview outperforms GPT‑5.4, Gemini 3.1 Pro and Opus 4.6 across dozens of AI benchmarks, autonomously discovers thousands of software vulnerabilities, exploits them without human guidance, and raises serious alignment and security concerns for the industry.

AI benchmarksAnthropicClaude Mythos
0 likes · 15 min read
Claude Mythos Preview Crushes Benchmarks and Reveals 27‑Year‑Old Zero‑Day
Baobao Algorithm Notes
Baobao Algorithm Notes
Aug 14, 2025 · Artificial Intelligence

Why Standard SFT Fails to Generalize and How One‑Line Dynamic Fine‑Tuning Fixes It

The article analyzes the poor generalization of supervised fine‑tuning (SFT) for large language models, reveals its gradient as a high‑variance inverse‑probability policy gradient, proposes a one‑line Dynamic Fine‑Tuning correction, and shows substantial gains on challenging math and offline RL benchmarks.

Dynamic Fine-TuningGeneralizationLLM alignment
0 likes · 7 min read
Why Standard SFT Fails to Generalize and How One‑Line Dynamic Fine‑Tuning Fixes It
Baobao Algorithm Notes
Baobao Algorithm Notes
Sep 10, 2024 · Artificial Intelligence

How Direct Preference Optimization Simplifies LLM Alignment Without Reward Models

This article breaks down the mathematical derivation of Direct Preference Optimization (DPO), showing how it replaces the traditional RLHF‑PPO pipeline by directly training an alignment model from human preference data, eliminating the need for a separate reward model and simplifying the overall training process.

DPOLLM alignmentPreference Optimization
0 likes · 17 min read
How Direct Preference Optimization Simplifies LLM Alignment Without Reward Models
Baobao Algorithm Notes
Baobao Algorithm Notes
Jul 9, 2024 · Artificial Intelligence

Why Step-Level DPO Is Revolutionizing LLM Math Reasoning

This article reviews recent step‑level DPO research, compares it with instance‑level DPO, explains the underlying Monte Carlo Tree Search formulation, and presents the author’s own replication experiments that demonstrate consistent performance gains across multiple LLM sizes on GSM8K and MATH benchmarks.

AI researchLLM alignmentMCTS
0 likes · 10 min read
Why Step-Level DPO Is Revolutionizing LLM Math Reasoning