Tagged articles
20 articles
Page 1 of 1
Baobao Algorithm Notes
Baobao Algorithm Notes
May 22, 2026 · Artificial Intelligence

How LiteScale Cuts Wait Times in Large‑Model Post‑Training with Gradient Accumulation

The article examines the bottleneck of synchronous rollout in large‑model post‑training, proposes an asynchronous design using gradient accumulation and a global micro‑batch count to preserve loss equivalence, and introduces LogitsExpress for efficient top‑K knowledge‑distillation communication, all implemented in the lightweight LiteScale framework.

Knowledge Distillationasynchronous rolloutdistributed training
0 likes · 16 min read
How LiteScale Cuts Wait Times in Large‑Model Post‑Training with Gradient Accumulation
Machine Heart
Machine Heart
May 8, 2026 · Artificial Intelligence

How an 8B Video‑Language Model Beats GPT‑5 and Gemini‑3.1‑Pro at Cinematic Understanding

The CHAI framework introduced by CMU and Harvard defines a structured video‑language annotation scheme, scalable human‑AI oversight, and a post‑training pipeline that enables an 8B open‑source model to outperform closed‑source GPT‑5 and Gemini‑3.1‑Pro on professional cinematic techniques.

Qwen3-VLVideo Generationannotation
0 likes · 11 min read
How an 8B Video‑Language Model Beats GPT‑5 and Gemini‑3.1‑Pro at Cinematic Understanding
Xiaomi Tech
Xiaomi Tech
Apr 27, 2026 · Artificial Intelligence

Xiaomi‑Robotics‑0: 20‑Hour Post‑Training Enables Seamless Earphone‑Box Assembly (Open‑Source)

The article details how Xiaomi‑Robotics‑0 achieves precise earphone‑to‑case insertion after only 20 hours of post‑training, outlines the sub‑millimetre precision challenges, presents a triple‑strategy (asynchronous execution, adaptive loss re‑weighting, Λ‑shape attention mask and random masking) to avoid the "lazy effect", and releases the full pipeline and code as open source for the robotics community.

Embodied AIXiaomi Roboticsaction prefixing
0 likes · 6 min read
Xiaomi‑Robotics‑0: 20‑Hour Post‑Training Enables Seamless Earphone‑Box Assembly (Open‑Source)
Architect
Architect
Apr 25, 2026 · Artificial Intelligence

DeepSeek V4: 1M‑Token Context’s Impact on Model, Inference, Cache & Agents

The DeepSeek V4 technical report shows how a 1 million‑token context forces a redesign of attention, KV‑cache, optimizer, quantization and inference budgeting, turning long‑context capability from a costly showcase into a production‑ready feature for agents, search and Chinese professional tasks.

1M contextAttention optimizationDeepSeek
0 likes · 28 min read
DeepSeek V4: 1M‑Token Context’s Impact on Model, Inference, Cache & Agents
AIWalker
AIWalker
Apr 20, 2026 · Artificial Intelligence

How VA‑π Bridges Tokenizers and Autoregressive Generators for Pixel‑Perfect Images

VA‑π introduces a lightweight post‑training framework that uses variational inference and reinforcement learning to align tokenizers with visual autoregressive generators, achieving dramatic quality gains, extreme training efficiency, and robust pixel‑level reconstruction across diverse image generation tasks.

Autoregressive ModelsPixel Alignmentpost-training
0 likes · 14 min read
How VA‑π Bridges Tokenizers and Autoregressive Generators for Pixel‑Perfect Images
Machine Heart
Machine Heart
Apr 19, 2026 · Artificial Intelligence

World Engine: How Post‑Training Is Launching a New Era of Physical AGI

World Engine introduces a post‑training pipeline that combines high‑fidelity 3DGS simulation, hard‑case mining with diffusion generation, and reinforcement‑learning optimization to give autonomous‑driving models true decision‑making ability, surpassing data‑scaling limits and achieving significant safety gains in both industrial simulations and real‑world tests.

Autonomous DrivingPhysical AISimulation
0 likes · 11 min read
World Engine: How Post‑Training Is Launching a New Era of Physical AGI
Data Party THU
Data Party THU
Apr 12, 2026 · Artificial Intelligence

What’s Driving the Next Wave of LLM Post‑Training? A Deep Dive into SFT, RLHF, GRPO and Emerging Trends

This article systematically reviews the core post‑training techniques for large language models—including supervised fine‑tuning, RLHF, PPO, GRPO, DPO, RLVR and Agentic RL—explains their evolution, compares their trade‑offs, and highlights the most promising research directions for 2025‑2026.

AI alignmentGRPOLLM
0 likes · 20 min read
What’s Driving the Next Wave of LLM Post‑Training? A Deep Dive into SFT, RLHF, GRPO and Emerging Trends
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Mar 15, 2026 · Artificial Intelligence

Is RL Dead in LLM Post-Training? MIT’s RandOpt Challenges Traditional Methods

The MIT‑CSAIL paper introduces RandOpt, a single‑step, gradient‑free, fully parallel post‑training algorithm that adds Gaussian noise to pretrained LLM weights and ensembles the results, achieving or surpassing PPO/GRPO performance by exploiting dense "neural thickets" that emerge as model scale grows.

LLMRandOptScaling Law
0 likes · 12 min read
Is RL Dead in LLM Post-Training? MIT’s RandOpt Challenges Traditional Methods
PaperAgent
PaperAgent
Jan 8, 2026 · Artificial Intelligence

How SOP Enables Scalable Online Post-Training for Real‑World Robots

The SOP (Scalable Online Post‑training) framework redesigns VLA post‑training from offline, single‑machine, sequential processing to a distributed, parallel online system, allowing robot fleets to continuously learn, share experiences, and scale intelligence while maintaining stability and generalization in complex real‑world environments.

Online LearningSOPVLA
0 likes · 11 min read
How SOP Enables Scalable Online Post-Training for Real‑World Robots
Baobao Algorithm Notes
Baobao Algorithm Notes
Nov 11, 2025 · Artificial Intelligence

Why Redesign the Training Stack? Inside Olmo‑Thinking’s Open‑Source RL Journey

This article provides a detailed technical analysis of the Olmo‑Thinking project, covering why a new open‑source LLM was built, the challenges of reinforcement learning at scale, data‑mix optimization, architectural bottlenecks such as missing GQA and QK‑Norm, and the post‑training techniques used to improve reasoning and long‑context capabilities.

Open-source modelsRLVRdata selection
0 likes · 20 min read
Why Redesign the Training Stack? Inside Olmo‑Thinking’s Open‑Source RL Journey
Alibaba Cloud Developer
Alibaba Cloud Developer
Jul 31, 2025 · Artificial Intelligence

Why Post‑Training Matters: Scaling Laws, Fine‑Tuning, and RL Strategies for LLMs

This article explores the importance of post‑training for large language models, explains scaling laws for pre‑ and post‑training, details common fine‑tuning methods (full, PEFT, LoRA), outlines alignment techniques such as RLHF, DPO, PPO, and presents practical workflows using Llama 3 and DeepSeek‑R1, while also discussing test‑time reasoning optimizations.

LLMRLHFalignment
0 likes · 19 min read
Why Post‑Training Matters: Scaling Laws, Fine‑Tuning, and RL Strategies for LLMs
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Jun 25, 2025 · Artificial Intelligence

Boost Post‑Training Efficiency with Cosmos‑RL, Ray, and VeRL on Alibaba PAI

This article introduces Alibaba Cloud's PAI platform and demonstrates how open‑source reinforcement‑learning frameworks such as Cosmos‑RL, Ray, and VeRL accelerate post‑training for large language models, offering higher throughput, fault‑tolerance, and seamless integration for AI developers.

AI PlatformOpen Source Frameworksdistributed training
0 likes · 9 min read
Boost Post‑Training Efficiency with Cosmos‑RL, Ray, and VeRL on Alibaba PAI
Baobao Algorithm Notes
Baobao Algorithm Notes
Mar 21, 2025 · Artificial Intelligence

Unlocking LLM Reasoning: A Deep Dive into Post‑Training Techniques

This article provides a comprehensive technical overview of large language model post‑training, covering fine‑tuning methods (full, parameter‑efficient, LoRA families, prompt tuning), domain‑adaptive tuning, reinforcement‑learning reward modeling, process vs. outcome rewards, inference‑enhancement strategies, dynamic compute allocation, verifier‑augmented reasoning, current challenges, and emerging research directions such as meta‑cognition, physical reasoning, and swarm intelligence.

LLMmeta-cognitionpost-training
0 likes · 21 min read
Unlocking LLM Reasoning: A Deep Dive into Post‑Training Techniques
Baobao Algorithm Notes
Baobao Algorithm Notes
Sep 29, 2024 · Artificial Intelligence

Decoding OpenAI o1: Test‑Time Scaling, PRM Search & Inference Strategies

This article analyses the training tricks behind OpenAI's o1 model, explaining test/inference‑time scaling laws, post‑training techniques, process‑supervised reward models (PRM), various inference‑time search methods, data‑collection pipelines, and the trade‑offs between allocating compute to pre‑training versus inference.

LLM inferenceOpenAI o1Reward Model
0 likes · 34 min read
Decoding OpenAI o1: Test‑Time Scaling, PRM Search & Inference Strategies