Collection size
98 articles
Page 2 of 5
DataFunTalk
DataFunTalk
Oct 30, 2025 · Artificial Intelligence

How On-Policy Distillation Cuts LLM Training Cost by 90%

Thinking Machines Lab introduces On-Policy Distillation, a post‑training technique that matches reinforcement‑learning performance while reducing compute cost by up to tenfold, and demonstrates its effectiveness through extensive experiments on reasoning, personalization, and catastrophic‑forgetting mitigation.

Knowledge Distillationmodel efficiencyon-policy distillation
0 likes · 15 min read
How On-Policy Distillation Cuts LLM Training Cost by 90%
Top Architect
Top Architect
Feb 14, 2025 · Artificial Intelligence

DeepSeek Model Distillation: Principles, Innovations, Architecture, and Performance

This article provides an in‑depth overview of DeepSeek’s model distillation technology, covering its definition, core principles, innovative data‑model distillation integration, architecture design, training strategies, performance gains, and the challenges of scaling to multimodal data.

DeepSeekKnowledge Transferai-optimization
0 likes · 16 min read
DeepSeek Model Distillation: Principles, Innovations, Architecture, and Performance
Baobao Algorithm Notes
Baobao Algorithm Notes
Jun 28, 2024 · Artificial Intelligence

What Makes Gemma 2 a Competitive Open‑Source LLM? Architecture, Training, and Evaluation Insights

The article provides a detailed technical overview of Gemma 2, covering its decoder‑only transformer design, novel attention mechanisms, logit soft‑capping, RMSNorm, knowledge‑distillation training on trillions of tokens, extensive pre‑training infrastructure, and benchmark evaluations that demonstrate its competitiveness against larger proprietary models.

AIGemma 2benchmark evaluation
0 likes · 14 min read
What Makes Gemma 2 a Competitive Open‑Source LLM? Architecture, Training, and Evaluation Insights
AntTech
AntTech
Mar 11, 2024 · Artificial Intelligence

Can Small Language Models be Good Reasoners in Recommender Systems?

This article presents SLIM, a knowledge‑distillation framework that transfers the reasoning abilities of large language models to compact models for sequential recommendation, enhancing item representation, user profiling, and bias mitigation while achieving comparable performance with far lower computational resources.

AIEfficiencyKnowledge Distillation
0 likes · 12 min read
Can Small Language Models be Good Reasoners in Recommender Systems?
Architect
Architect
Feb 9, 2025 · Artificial Intelligence

How DeepSeek’s Model Distillation Boosts AI Efficiency and Performance

This article provides an in‑depth analysis of DeepSeek’s model distillation technology, covering its definition, core principles, innovative strategies, architecture design, training optimizations, benchmark results, efficiency gains, and the remaining challenges of applying distillation to large language models and multimodal data.

AI efficiencyDeepSeekKnowledge Transfer
0 likes · 16 min read
How DeepSeek’s Model Distillation Boosts AI Efficiency and Performance
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Apr 16, 2026 · Artificial Intelligence

Efficient Reasoning with Reward Shaping: Compressing Qwen 30B‑Series Chains by 20‑40%

The article analyzes how reward‑shaping techniques can shorten the chain‑of‑thought outputs of Qwen 30‑parameter series models by 20‑40% while preserving or slightly improving performance on AIME‑25 and out‑of‑distribution benchmarks, and it details the experimental design, strategic considerations, and practical insights behind this efficient reasoning approach.

Efficient InferenceQwenReward Shaping
0 likes · 16 min read
Efficient Reasoning with Reward Shaping: Compressing Qwen 30B‑Series Chains by 20‑40%
Machine Heart
Machine Heart
May 18, 2026 · Artificial Intelligence

Can Large Models Reason Deeply with Only a Few Thinking Tokens?

The paper introduces Heima, a framework that compresses chain‑of‑thought reasoning into a small set of abstract “thinking tokens” for multimodal large models, dramatically reducing generated tokens while preserving inference capability, and provides an adaptive interpreter to reconstruct human‑readable reasoning for analysis.

Efficient Inferencechain-of-thoughtlatent reasoning
0 likes · 12 min read
Can Large Models Reason Deeply with Only a Few Thinking Tokens?
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Mar 3, 2026 · Artificial Intelligence

Beyond Dense and MoE: JTok Module Cuts Compute by One‑Third as a New Scaling Path

The paper introduces JTok and its dynamic variant JTok‑M, a token‑indexed parameter scaling method that decouples model capacity from compute, achieving up to 35% compute reduction while delivering consistent performance gains across a wide range of downstream tasks and model sizes.

Compute EfficiencyJTokToken-indexed scaling
0 likes · 16 min read
Beyond Dense and MoE: JTok Module Cuts Compute by One‑Third as a New Scaling Path
Bilibili Tech
Bilibili Tech
Dec 19, 2025 · Artificial Intelligence

SABER: Switchable and Balanced Training for Efficient LLM Reasoning

SABER introduces a reinforcement‑learning framework that lets large language models dynamically switch among four token‑budgeted reasoning modes, dramatically cutting inference length while preserving or improving accuracy across math, code, and logic tasks.

Budgeted ComputationEfficient ReasoningLLM
0 likes · 13 min read
SABER: Switchable and Balanced Training for Efficient LLM Reasoning
Machine Heart
Machine Heart
Apr 28, 2026 · Artificial Intelligence

Can LLMs Answer More Accurately While Writing Less? Introducing SHAPE’s Reasoning Tax

The SHAPE framework (Stage‑aware Hierarchical Advantage via Potential Estimation) adds a milestone‑based “reasoning tax” to large language model inference, providing step‑wise correctness signals and penalizing verbosity, which yields an average 3% accuracy gain and a 30% reduction in token consumption across multiple math‑reasoning benchmarks.

ACL 2026LLMMathematical Reasoning
0 likes · 10 min read
Can LLMs Answer More Accurately While Writing Less? Introducing SHAPE’s Reasoning Tax
AI Frontier Lectures
AI Frontier Lectures
Jun 9, 2025 · Artificial Intelligence

AI Research Highlights: Robo-DM, DeepKD, LLM Security, and Reasoning Innovations

This roundup presents recent AI breakthroughs, including Robo‑DM’s efficient robot dataset management, DeepKD’s decoupled knowledge‑distillation trainer, a novel informed white‑box attack exposing weaknesses in LLM alignment defenses, the RePPL hallucination detector, Self‑GIVE’s associative reasoning framework, and LLM‑driven RL ensemble methods.

AIKnowledge DistillationReasoning
0 likes · 15 min read
AI Research Highlights: Robo-DM, DeepKD, LLM Security, and Reasoning Innovations
Machine Heart
Machine Heart
May 30, 2026 · Artificial Intelligence

How Abstract Symbols Cut AI Inference Cost by 11×

The article examines IBM Research's Abstract‑CoT approach, which replaces verbose natural‑language chain‑of‑thought reasoning with a compact abstract token vocabulary, achieving up to an 11‑fold reduction in inference tokens while maintaining comparable accuracy across math, instruction‑following, and multi‑hop QA benchmarks.

AI inferenceAbstract-CoTchain-of-thought
0 likes · 11 min read
How Abstract Symbols Cut AI Inference Cost by 11×
Code Mala Tang
Code Mala Tang
May 31, 2026 · Artificial Intelligence

Top 10 AI Papers This Week: SkillOpt, Agent Distillation, and Sleeping LLMs

This roundup reviews ten recent AI papers covering SkillOpt’s treat‑SKILL.md as trainable parameters, compiling whole agent pipelines into model weights, decentralized AI scientist teams, adding a "sleep" consolidation phase to LLMs, interface‑only fixes for frozen agents, reuse‑aware context‑cost strategies, evaluating AI’s ability to forecast scientific breakthroughs, agent aging benchmarks, the trade‑offs of complex harnesses, and multilingual food‑embedding models.

AI agentsAgent AgingAgent Distillation
0 likes · 18 min read
Top 10 AI Papers This Week: SkillOpt, Agent Distillation, and Sleeping LLMs
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Apr 8, 2026 · Artificial Intelligence

Dissecting Gemma‑4’s Architecture and Training Choices: A Technical Comparison with Qwen‑3 and GLM‑5

This article breaks down every architectural and training decision behind Gemma‑4—KV sharing, p‑RoPE, per‑layer embeddings, and a dual‑path MoE + dense MLP—while contrasting its efficiency and performance with Qwen‑3 and GLM‑5 across benchmarks, quantization strategies, and RL pipelines.

GLM-5Gemma 4LLM architecture
0 likes · 23 min read
Dissecting Gemma‑4’s Architecture and Training Choices: A Technical Comparison with Qwen‑3 and GLM‑5
DataFunTalk
DataFunTalk
Dec 24, 2021 · Artificial Intelligence

Large-Scale Pretrained Model Compression and Distillation: AdaBERT, L2A, and Meta‑KD

This article reviews three consecutive works from Alibaba DAMO Academy on compressing and distilling large pretrained language models—AdaBERT, L2A, and Meta‑KD—detailing their motivations, neural‑architecture‑search‑based designs, loss formulations, experimental results, and insights from a Q&A session.

AIKnowledge DistillationModel Compression
0 likes · 10 min read
Large-Scale Pretrained Model Compression and Distillation: AdaBERT, L2A, and Meta‑KD
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Feb 26, 2026 · Artificial Intelligence

Why Longer Token Chains Don't Mean Better Reasoning: Google's Deep Thinking Ratio

Google’s recent study shows that the length of a model’s token chain is negatively correlated with inference accuracy, and introduces the Deep Thinking Ratio (DTR) metric to identify truly reasoning tokens, enabling the Think@n strategy to halve compute cost without sacrificing performance.

Deep Thinking RatioLLMThink@n
0 likes · 6 min read
Why Longer Token Chains Don't Mean Better Reasoning: Google's Deep Thinking Ratio
Machine Heart
Machine Heart
May 13, 2026 · Artificial Intelligence

Why Bigger Teachers Don’t Teach Better: Tsinghua’s On‑Policy Distillation Study

Recent research by Tsinghua and collaborators dissects On‑Policy Distillation for large language models, revealing that higher‑scoring teachers often fail to improve students unless their thinking patterns align, detailing token‑level overlap dynamics, failure cases, and two practical remedies to rescue ineffective distillation.

Model ScalingRL Post-TrainingTeacher-Student Alignment
0 likes · 9 min read
Why Bigger Teachers Don’t Teach Better: Tsinghua’s On‑Policy Distillation Study