Author

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

294

Articles

Likes

147

Views

Comments

Latest from Baobao Algorithm Notes

100 recent articles max

Baobao Algorithm Notes

Aug 14, 2025 · Artificial Intelligence

Why Standard SFT Fails to Generalize and How One‑Line Dynamic Fine‑Tuning Fixes It

The article analyzes the poor generalization of supervised fine‑tuning (SFT) for large language models, reveals its gradient as a high‑variance inverse‑probability policy gradient, proposes a one‑line Dynamic Fine‑Tuning correction, and shows substantial gains on challenging math and offline RL benchmarks.

Dynamic Fine-TuningGeneralizationLLM alignment

0 likes · 7 min read

Why Standard SFT Fails to Generalize and How One‑Line Dynamic Fine‑Tuning Fixes It

Baobao Algorithm Notes

Aug 11, 2025 · Industry Insights

Why AI Infrastructure Must Be Close to Models and Hardware – Insights from Zhu Yibo

In a WAIC 2025 interview, Zhu Yibo, co‑founder of Jiejie Xingchen, shares deep insights on AI infrastructure, covering its evolution, the need for tight model‑hardware co‑design, cost‑efficiency metrics, industry challenges, and future directions for large‑scale AI systems.

AI infrastructureIndustry InsightsMachine Learning

0 likes · 36 min read

Why AI Infrastructure Must Be Close to Models and Hardware – Insights from Zhu Yibo

Baobao Algorithm Notes

Aug 4, 2025 · Artificial Intelligence

Why GPT‑OSS Chooses a 64‑Dimensional Attention Head and 2880 Hidden Size

This article analyzes the surprising design choices of the rumored GPT‑OSS 120B model, explaining the rationale behind a 64‑dimensional attention head, the equal hidden and intermediate sizes, and other quirky parameters such as MLP bias and KV‑sink SWA, backed by theoretical formulas and empirical benchmarks.

Attention HeadGPT-OSSMLP Ratio

0 likes · 13 min read

Why GPT‑OSS Chooses a 64‑Dimensional Attention Head and 2880 Hidden Size

Baobao Algorithm Notes

Aug 1, 2025 · Artificial Intelligence

Why Training Large Language Models Feels Like Alchemy—and How to Master It

This article breaks down the hardware bottlenecks of large‑scale LLM training, explains the Roofline performance model, arithmetic intensity, and how computation and communication costs interact on GPUs and TPUs, offering concrete formulas and examples for efficient scaling.

Arithmetic intensityDistributed computingGPU

0 likes · 12 min read

Why Training Large Language Models Feels Like Alchemy—and How to Master It

Baobao Algorithm Notes

Aug 1, 2025 · Artificial Intelligence

Unlocking Qwen3-Coder-30B: Features, Fast Start, and Agentic Coding Guide

The article introduces Qwen3‑Coder‑30B‑A3B‑Instruct (aka Qwen3‑Coder‑Flash), detailing its architecture, 256K‑to‑1M token context, agentic coding capabilities, installation steps with Transformers, sample code for tool use, optimal sampling parameters, and deployment tips across various runtimes.

AI coding assistantAgentic CodingLarge Language Model

0 likes · 6 min read

Unlocking Qwen3-Coder-30B: Features, Fast Start, and Agentic Coding Guide

Baobao Algorithm Notes

Jul 29, 2025 · Artificial Intelligence

Qwen3‑30B‑A3B‑Instruct‑2507: New Instruction Model with Boosted General and Multilingual Skills

The Qwen3‑30B‑A3B‑Instruct‑2507 model, an updated non‑thinking version of Qwen3‑30B‑A3B, delivers significant gains in instruction following, reasoning, multilingual knowledge coverage, and 256K context length, and its performance is benchmarked against leading LLMs across a wide range of tasks.

Instruction TuningMixture‑of‑ExpertsQwen3

0 likes · 6 min read

Qwen3‑30B‑A3B‑Instruct‑2507: New Instruction Model with Boosted General and Multilingual Skills

Baobao Algorithm Notes

Jul 28, 2025 · Industry Insights

Why AWS Bedrock AgentCore Signals a New Era for Agentic AI Infrastructure

The article analyzes AWS Bedrock AgentCore and related hardware and software requirements for Agentic AI, covering runtime isolation with microVMs, memory architectures, identity and gateway design, zero‑trust networking, and the challenges of multi‑tenant KVCache and context engineering.

AWS BedrockInfrastructureMemory Management

0 likes · 15 min read

Why AWS Bedrock AgentCore Signals a New Era for Agentic AI Infrastructure

Baobao Algorithm Notes

Jul 18, 2025 · Artificial Intelligence

30+ Expert Q&A on Large Language Model Architecture, Training, and Deployment

This article compiles more than thirty interview‑style questions and detailed answers covering large‑model fundamentals such as encoder‑decoder trade‑offs, self‑attention versus RNN, context length, tokenization, embedding strategies, FlashAttention, RoPE, prompt design, retrieval‑augmented generation, safety measures, fine‑tuning, and model distillation, providing a comprehensive technical reference for practitioners.

attention mechanismretrieval-augmented generation

0 likes · 53 min read

30+ Expert Q&A on Large Language Model Architecture, Training, and Deployment

Baobao Algorithm Notes

Jul 17, 2025 · Artificial Intelligence

How QK-Clip Tames MaxLogit Explosions in Trillion‑Parameter LLMs

The article introduces QK-Clip, a lightweight per‑head weight‑clipping technique that uses the MaxLogit signal to prevent uncontrolled logit growth in massive LLMs, explains its design, compares it with prior methods, and shows that it stabilizes training without harming model performance.

Attention stabilityLLM trainingMaxLogit

0 likes · 15 min read

How QK-Clip Tames MaxLogit Explosions in Trillion‑Parameter LLMs

Baobao Algorithm Notes

Jul 16, 2025 · Artificial Intelligence

What Small Labs Reveal About RL Training: Multi‑Stage, Entropy, and Resource Strategies

The article analyzes Skywork OR1's technical report, detailing how small‑scale teams use GRPO‑based reinforcement learning with multi‑stage training, advantage‑mask variants, high‑temperature sampling, adaptive entropy loss, and resource‑allocation tricks to improve large language model performance while avoiding premature entropy collapse.

AI researchentropy controlmulti-stage training

0 likes · 21 min read

What Small Labs Reveal About RL Training: Multi‑Stage, Entropy, and Resource Strategies