Tagged articles

12 articles

Page 1 of 1

May 30, 2026 · Artificial Intelligence

Solving AdamW & Muon Instability: Pion Optimizer Updates Large Models on an Iso‑Spectral Manifold

The Pion optimizer leverages iso‑spectral manifold updates to preserve the spectral norm of weight matrices, eliminating additive‑update instability and enabling stable, efficient training of billion‑parameter LLMs across pre‑training, fine‑tuning, and reinforcement‑learning stages, outperforming AdamW and Muon.

AdamWLarge Language ModelsMuon

0 likes · 14 min read

Solving AdamW & Muon Instability: Pion Optimizer Updates Large Models on an Iso‑Spectral Manifold

Machine Learning Algorithms & Natural Language Processing

May 8, 2026 · Artificial Intelligence

T²PO: Uncertainty‑Guided Exploration Control for Stable Multi‑Turn Agent RL

The paper identifies inefficient exploration, termed "hesitation," as the root cause of instability in multi‑turn reinforcement learning for LLM agents and introduces T²PO, an uncertainty‑driven token‑ and turn‑level control framework that markedly improves training stability and performance on benchmarks such as WebShop, ALFWorld, and Search QA.

LLM agentsT2POexploration control

0 likes · 16 min read

T²PO: Uncertainty‑Guided Exploration Control for Stable Multi‑Turn Agent RL

Machine Learning Algorithms & Natural Language Processing

Apr 25, 2026 · Artificial Intelligence

Why DeepSeek‑V4 Took Twice as Long: Inside the Training‑Stability Challenges and Engineering Hacks

The DeepSeek‑V4 technical report reveals that the model’s doubled training time stems from massive token and parameter scaling, severe training‑stability issues in MoE layers, and a suite of engineering solutions—including Anticipatory Routing, SwiGLU Clamping, specialist expert training, and a custom sandbox cluster—while also exposing high hallucination rates despite impressive benchmark performance.

BenchmarkDeepSeek V4Generative Reward Model

0 likes · 12 min read

Why DeepSeek‑V4 Took Twice as Long: Inside the Training‑Stability Challenges and Engineering Hacks

Shuge Unlimited

Apr 25, 2026 · Artificial Intelligence

DeepSeek V4: Comeback? 1.6 T Params, Million‑Token Context, Open‑Source Matches Closed‑Source

DeepSeek V4, released shortly after GPT‑5.5, offers two models—V4‑Pro (1.6 T parameters) and V4‑Flash (284 B parameters)—that introduce a hybrid CSA/HCA attention architecture to enable efficient million‑token context, achieve dramatic FLOPs and KV savings, deliver competitive programming and agent benchmarks, and adopt a disruptive pricing strategy, while also exposing training‑stability tricks and highlighting both strengths and remaining gaps.

BenchmarkDeepSeek V4Hybrid Attention

0 likes · 25 min read

DeepSeek V4: Comeback? 1.6 T Params, Million‑Token Context, Open‑Source Matches Closed‑Source

Machine Learning Algorithms & Natural Language Processing

Apr 14, 2026 · Artificial Intelligence

Revisiting On-Policy Distillation (OPD): Typical Failures and a More Stable Fix

On‑Policy Distillation (OPD) is widely used for post‑training large language models, but the sampled‑token variant often becomes unstable due to token‑level reward imbalance, teacher‑student signal mismatch on student‑generated prefixes, and tokenizer mismatches; this article analyses the bias‑variance trade‑off, identifies three root failure modes, and proposes a teacher‑top‑K local‑support‑set objective with top‑p rollout and special‑token masking that yields more stable training and better performance on both math and agentic benchmarks.

Large Language ModelsOPDon-policy distillation

0 likes · 32 min read

Revisiting On-Policy Distillation (OPD): Typical Failures and a More Stable Fix

PaperAgent

Jan 1, 2026 · Artificial Intelligence

How Manifold-Constrained Hyper-Connections Boost Large-Scale Model Training Efficiency

The article introduces mHC, a Manifold‑Constrained Hyper‑Connections technique that replaces standard residual links with multiple learned pathways, using double‑stochastic matrices to lock gradients, achieving stable training of 27‑billion‑parameter models with only 6.7% extra compute and superior performance across eight downstream benchmarks.

AI ArchitectureEfficient ImplementationManifold-Constrained

0 likes · 6 min read

How Manifold-Constrained Hyper-Connections Boost Large-Scale Model Training Efficiency

AI Frontier Lectures

Dec 9, 2025 · Artificial Intelligence

Can Token‑Level Surrogates Stabilize RL for Large Language Models? A Deep Dive

This article analyzes why optimizing sequence‑level rewards for LLMs with token‑level surrogate objectives can improve reinforcement‑learning stability, explains the theoretical conditions required, introduces Routing Replay for MoE models, and presents extensive experiments validating the approach.

Importance SamplingLarge Language ModelsMixture of Experts

0 likes · 12 min read

Can Token‑Level Surrogates Stabilize RL for Large Language Models? A Deep Dive

AI2ML AI to Machine Learning

Nov 3, 2025 · Artificial Intelligence

Smol Training Playbook: Secrets to Building World-Class LLMs

The article details the SmolLM3 3B‑parameter model, its architecture, dual‑mode inference, a three‑stage data‑curation strategy, rigorous ablation methods, preference optimisation (APO/DPO), model merging, and practical training‑stability tricks, offering a comprehensive guide for building high‑performing large language models.

APOLLM trainingcontext scaling

0 likes · 13 min read

Smol Training Playbook: Secrets to Building World-Class LLMs

Baobao Algorithm Notes

Oct 30, 2025 · Artificial Intelligence

Why LLM RL Training Crashes While SFT Stays Stable: Insights & Tricks

The article examines the fundamental similarity between SFT and RL loss functions for large language models, explains why RL training is prone to instability, discusses infrastructure and data quality challenges, and reviews practical tricks and reward‑model considerations for more reliable RL fine‑tuning.

AILLMReward Modeling

0 likes · 11 min read

Why LLM RL Training Crashes While SFT Stays Stable: Insights & Tricks

Data Party THU

Sep 10, 2025 · Industry Insights

MoE vs MoR: Deep Dive into Expert and Recursive Mixture Architectures for LLMs

This article provides a comprehensive technical comparison between Mixture of Experts (MoE) and the newly proposed Mixture of Recursion (MoR) architectures, covering design principles, parameter efficiency, inference latency, training stability, routing mechanisms, hardware deployment considerations, and suitable application scenarios.

Hardware DeploymentInference PerformanceMixture of Experts

0 likes · 13 min read

MoE vs MoR: Deep Dive into Expert and Recursive Mixture Architectures for LLMs

AIWalker

Apr 13, 2025 · Artificial Intelligence

Huawei Pangu Ultra: 135B Ascend‑Native Dense LLM Without Nvidia GPUs

Huawei's Pangu Ultra introduces a 135‑billion‑parameter dense language model trained entirely on Ascend NPUs, detailing novel stability architectures, a domain‑aware tokenizer, multi‑stage pre‑training, extensive system optimizations, and benchmark results that surpass leading models such as Llama 405B and DeepSeek‑R1.

Ascend NPUDense ModelLarge Language Model

0 likes · 15 min read

Huawei Pangu Ultra: 135B Ascend‑Native Dense LLM Without Nvidia GPUs

Code DAO

Dec 5, 2021 · Artificial Intelligence

Why Neural Networks Need Batch Normalization: Principles and Mechanics

The article explains the principle behind Batch Normalization, why it is essential for training deep neural networks, how it standardizes activations, the role of learnable scale and shift parameters, the computation steps during training and inference, and discusses placement strategies within a model.

Batch Normalizationdeep learninggradient descent

0 likes · 9 min read

Why Neural Networks Need Batch Normalization: Principles and Mechanics