Author

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

294

Articles

Likes

144

Views

Comments

Latest from Baobao Algorithm Notes

100 recent articles max

Baobao Algorithm Notes

Jul 10, 2025 · Industry Insights

Grok 4 Unveiled: Why xAI Claims Its New Model Beats the Competition

On July 10, xAI launched Grok 4, a multimodal LLM with a 256K‑token context window, tool‑use upgrades and benchmark scores that surpass existing models, while pricing it at $30/month for the standard tier and $300/month for the heavy tier.

AI benchmarksGrok 4industry analysis

0 likes · 6 min read

Grok 4 Unveiled: Why xAI Claims Its New Model Beats the Competition

Baobao Algorithm Notes

Jul 2, 2025 · Industry Insights

Why Baidu’s Open‑Source Ernie 4.5 Could Redefine the Global AI Race

Baidu has open‑sourced ten Ernie 4.5 models ranging from 0.3B to 424B parameters, featuring multimodal MoE pre‑training, advanced infrastructure, and post‑training tricks that deliver benchmark results surpassing DeepSeek‑V3 and OpenAI‑o1, sparking worldwide industry attention and reshaping AI competition.

AI competitionBaiduErnie

0 likes · 8 min read

Why Baidu’s Open‑Source Ernie 4.5 Could Redefine the Global AI Race

Baobao Algorithm Notes

Jun 30, 2025 · Artificial Intelligence

How End‑to‑End Reinforcement Learning Powers the Kimi‑Researcher AI Agent

The article examines Kimi‑Researcher, an AI research agent built with end‑to‑end reinforcement learning, detailing its technical motivations, advantages over traditional workflow‑based and SFT methods, performance breakthroughs on benchmark exams, and diverse real‑world use cases ranging from literature reviews to legal analysis.

AI agentEnd-to-End RLKimi Researcher

0 likes · 10 min read

How End‑to‑End Reinforcement Learning Powers the Kimi‑Researcher AI Agent

Baobao Algorithm Notes

Jun 13, 2025 · Artificial Intelligence

How GVPO Improves LLM Fine‑Tuning: Stable, Sample‑Rich Policy Optimization

The article introduces GVPO, a Group Variance Policy Optimization method that uniquely achieves KL‑constrained reward maximization, supports diverse sampling distributions, and resolves instability and inefficiency issues found in GRPO and traditional policy‑gradient approaches for large language model post‑training.

GVPOKL constraintPolicy Optimization

0 likes · 9 min read

How GVPO Improves LLM Fine‑Tuning: Stable, Sample‑Rich Policy Optimization

Baobao Algorithm Notes

Jun 9, 2025 · Industry Insights

Why AI Agents Won’t Quickly Deliver AGI: Data Gaps and Realistic Timelines

The article argues that despite rapid advances in large‑model benchmarks, the lack of real‑world data and suitable tasks creates a fundamental gap that will keep AI agents far from replacing 80% of white‑collar work for many years, making hype about imminent AGI unrealistic.

AGIAI AgentsAutonomous Driving

0 likes · 11 min read

Why AI Agents Won’t Quickly Deliver AGI: Data Gaps and Realistic Timelines

Baobao Algorithm Notes

Jun 6, 2025 · Artificial Intelligence

What AI Programming Agents Reveal About RL, Feedback Loops, and Long‑Context Challenges

In a deep dive into the Cursor team's podcast, core members dissect the current hurdles of AI programming agents, covering feedback‑mechanism design, reinforcement‑learning reward sparsity, tool‑chain integration, long‑context handling, and emerging attention mechanisms that shape the future of code‑centric AI.

AI programmingattention mechanismslong-context

0 likes · 35 min read

What AI Programming Agents Reveal About RL, Feedback Loops, and Long‑Context Challenges

Baobao Algorithm Notes

Jun 4, 2025 · Artificial Intelligence

Do Recent LLM‑RL Papers Overstate Their Gains? A Critical Review

This article critically examines seven high‑profile reinforcement‑learning papers for large language models, exposing flawed baseline evaluations, unrealistic settings, and modest actual improvements despite bold claims of dramatic performance gains.

AI researchLLMbaseline evaluation

0 likes · 8 min read

Do Recent LLM‑RL Papers Overstate Their Gains? A Critical Review

Baobao Algorithm Notes

Jun 3, 2025 · Artificial Intelligence

Can 1K Fine‑Tuning Replace 100K RL Steps? Insights from Re‑distillation Research

An extensive analysis shows that a 1K‑sample fine‑tuning stage can replicate the generalization gains of thousands of reinforcement‑learning steps, explains the compressibility of RL, introduces a sample‑effect theory, and demonstrates that re‑distillation and small‑scale SFT dramatically improve LLM performance.

Re-distillationSample Effectlarge language models

0 likes · 23 min read

Can 1K Fine‑Tuning Replace 100K RL Steps? Insights from Re‑distillation Research

Baobao Algorithm Notes

Jun 3, 2025 · Artificial Intelligence

How to Train a 671B‑Scale Model with RL: Insights from a verl Internship

This article shares a detailed, first‑hand analysis of the technical challenges, framework choices, memory management, weight conversion, precision alignment, and efficiency optimizations encountered while building reinforcement‑learning pipelines for a 671‑billion‑parameter model using the verl ecosystem.

GPU Memory ManagementLarge ModelsMegatron

0 likes · 16 min read

How to Train a 671B‑Scale Model with RL: Insights from a verl Internship

Baobao Algorithm Notes

May 26, 2025 · Artificial Intelligence

Why Do Reasoning LLMs Lose Instruction-Following Ability? A Deep Dive into Recent Findings

This article compares two recent papers that investigate why large reasoning models such as Llama and Qwen show degraded instruction‑following performance when using chain‑of‑thought prompting, analyzing attention patterns, training effects, and proposed mitigation strategies.

LLMattentionchain-of-thought

0 likes · 11 min read

Why Do Reasoning LLMs Lose Instruction-Following Ability? A Deep Dive into Recent Findings