Collection size

98 articles

Page 2 of 5

Oct 30, 2025 · Artificial Intelligence

How On-Policy Distillation Cuts LLM Training Cost by 90%

Thinking Machines Lab introduces On-Policy Distillation, a post‑training technique that matches reinforcement‑learning performance while reducing compute cost by up to tenfold, and demonstrates its effectiveness through extensive experiments on reasoning, personalization, and catastrophic‑forgetting mitigation.

Knowledge Distillationmodel efficiencyon-policy distillation

0 likes · 15 min read

How On-Policy Distillation Cuts LLM Training Cost by 90%

Top Architect

Feb 14, 2025 · Artificial Intelligence

DeepSeek Model Distillation: Principles, Innovations, Architecture, and Performance

This article provides an in‑depth overview of DeepSeek’s model distillation technology, covering its definition, core principles, innovative data‑model distillation integration, architecture design, training strategies, performance gains, and the challenges of scaling to multimodal data.

DeepSeekKnowledge Transferai-optimization

0 likes · 16 min read

DeepSeek Model Distillation: Principles, Innovations, Architecture, and Performance

Baobao Algorithm Notes

Jun 28, 2024 · Artificial Intelligence

What Makes Gemma 2 a Competitive Open‑Source LLM? Architecture, Training, and Evaluation Insights

The article provides a detailed technical overview of Gemma 2, covering its decoder‑only transformer design, novel attention mechanisms, logit soft‑capping, RMSNorm, knowledge‑distillation training on trillions of tokens, extensive pre‑training infrastructure, and benchmark evaluations that demonstrate its competitiveness against larger proprietary models.

AIGemma 2benchmark evaluation

0 likes · 14 min read

What Makes Gemma 2 a Competitive Open‑Source LLM? Architecture, Training, and Evaluation Insights

Alibaba Cloud Big Data AI Platform

Nov 5, 2024 · Artificial Intelligence

How DistilQwen2 Boosts LLM Performance with Knowledge Distillation

This article introduces DistilQwen2, a lightweight language model derived from Qwen2 via knowledge distillation, detailing its data collection, instruction‑data optimization, training strategies, extensive benchmark evaluations, and practical deployment guides for developers and enterprises.

AIInstruction TuningKnowledge Distillation

0 likes · 21 min read

How DistilQwen2 Boosts LLM Performance with Knowledge Distillation

AntTech

Mar 11, 2024 · Artificial Intelligence

Can Small Language Models be Good Reasoners in Recommender Systems?

This article presents SLIM, a knowledge‑distillation framework that transfers the reasoning abilities of large language models to compact models for sequential recommendation, enhancing item representation, user profiling, and bias mitigation while achieving comparable performance with far lower computational resources.

AIEfficiencyKnowledge Distillation

0 likes · 12 min read

Can Small Language Models be Good Reasoners in Recommender Systems?

Old Zhang's AI Learning

Mar 25, 2026 · Artificial Intelligence

Claude‑Opus‑4.6 Distilled Qwen3.5 v2: Faster Reasoning with Same Code Accuracy

The new Claude‑Opus‑4.6 distilled Qwen3.5‑v2 keeps code‑generation accuracy while cutting reasoning length by 24% and boosting per‑token correctness by 31.6%, offering a noticeable speed and cost advantage for local LLM deployment despite a 7.2% drop on MMLU‑Pro.

Claude Opusdistillationlocal LLM deployment

0 likes · 7 min read

Claude‑Opus‑4.6 Distilled Qwen3.5 v2: Faster Reasoning with Same Code Accuracy

Architect

Feb 9, 2025 · Artificial Intelligence

How DeepSeek’s Model Distillation Boosts AI Efficiency and Performance

This article provides an in‑depth analysis of DeepSeek’s model distillation technology, covering its definition, core principles, innovative strategies, architecture design, training optimizations, benchmark results, efficiency gains, and the remaining challenges of applying distillation to large language models and multimodal data.

AI efficiencyDeepSeekKnowledge Transfer

0 likes · 16 min read

How DeepSeek’s Model Distillation Boosts AI Efficiency and Performance

Machine Learning Algorithms & Natural Language Processing

Apr 16, 2026 · Artificial Intelligence

Efficient Reasoning with Reward Shaping: Compressing Qwen 30B‑Series Chains by 20‑40%

The article analyzes how reward‑shaping techniques can shorten the chain‑of‑thought outputs of Qwen 30‑parameter series models by 20‑40% while preserving or slightly improving performance on AIME‑25 and out‑of‑distribution benchmarks, and it details the experimental design, strategic considerations, and practical insights behind this efficient reasoning approach.

Efficient InferenceQwenReward Shaping

0 likes · 16 min read

Efficient Reasoning with Reward Shaping: Compressing Qwen 30B‑Series Chains by 20‑40%

Machine Heart

May 18, 2026 · Artificial Intelligence

Can Large Models Reason Deeply with Only a Few Thinking Tokens?

The paper introduces Heima, a framework that compresses chain‑of‑thought reasoning into a small set of abstract “thinking tokens” for multimodal large models, dramatically reducing generated tokens while preserving inference capability, and provides an adaptive interpreter to reconstruct human‑readable reasoning for analysis.

Efficient Inferencechain-of-thoughtlatent reasoning

0 likes · 12 min read

Can Large Models Reason Deeply with Only a Few Thinking Tokens?

Machine Learning Algorithms & Natural Language Processing

Mar 3, 2026 · Artificial Intelligence

Beyond Dense and MoE: JTok Module Cuts Compute by One‑Third as a New Scaling Path

The paper introduces JTok and its dynamic variant JTok‑M, a token‑indexed parameter scaling method that decouples model capacity from compute, achieving up to 35% compute reduction while delivering consistent performance gains across a wide range of downstream tasks and model sizes.

Compute EfficiencyJTokToken-indexed scaling

0 likes · 16 min read

Beyond Dense and MoE: JTok Module Cuts Compute by One‑Third as a New Scaling Path

Bilibili Tech

Dec 19, 2025 · Artificial Intelligence

SABER: Switchable and Balanced Training for Efficient LLM Reasoning

SABER introduces a reinforcement‑learning framework that lets large language models dynamically switch among four token‑budgeted reasoning modes, dramatically cutting inference length while preserving or improving accuracy across math, code, and logic tasks.

Budgeted ComputationEfficient ReasoningLLM

0 likes · 13 min read

SABER: Switchable and Balanced Training for Efficient LLM Reasoning

Machine Heart

Apr 28, 2026 · Artificial Intelligence

Can LLMs Answer More Accurately While Writing Less? Introducing SHAPE’s Reasoning Tax

The SHAPE framework (Stage‑aware Hierarchical Advantage via Potential Estimation) adds a milestone‑based “reasoning tax” to large language model inference, providing step‑wise correctness signals and penalizing verbosity, which yields an average 3% accuracy gain and a 30% reduction in token consumption across multiple math‑reasoning benchmarks.

ACL 2026LLMMathematical Reasoning

0 likes · 10 min read

Can LLMs Answer More Accurately While Writing Less? Introducing SHAPE’s Reasoning Tax

AI Frontier Lectures

Jun 9, 2025 · Artificial Intelligence

AI Research Highlights: Robo-DM, DeepKD, LLM Security, and Reasoning Innovations

This roundup presents recent AI breakthroughs, including Robo‑DM’s efficient robot dataset management, DeepKD’s decoupled knowledge‑distillation trainer, a novel informed white‑box attack exposing weaknesses in LLM alignment defenses, the RePPL hallucination detector, Self‑GIVE’s associative reasoning framework, and LLM‑driven RL ensemble methods.

AIKnowledge DistillationReasoning

0 likes · 15 min read

AI Research Highlights: Robo-DM, DeepKD, LLM Security, and Reasoning Innovations

Machine Heart

May 30, 2026 · Artificial Intelligence

How Abstract Symbols Cut AI Inference Cost by 11×

The article examines IBM Research's Abstract‑CoT approach, which replaces verbose natural‑language chain‑of‑thought reasoning with a compact abstract token vocabulary, achieving up to an 11‑fold reduction in inference tokens while maintaining comparable accuracy across math, instruction‑following, and multi‑hop QA benchmarks.

AI inferenceAbstract-CoTchain-of-thought

0 likes · 11 min read

How Abstract Symbols Cut AI Inference Cost by 11×

Code Mala Tang

May 31, 2026 · Artificial Intelligence

Top 10 AI Papers This Week: SkillOpt, Agent Distillation, and Sleeping LLMs

This roundup reviews ten recent AI papers covering SkillOpt’s treat‑SKILL.md as trainable parameters, compiling whole agent pipelines into model weights, decentralized AI scientist teams, adding a "sleep" consolidation phase to LLMs, interface‑only fixes for frozen agents, reuse‑aware context‑cost strategies, evaluating AI’s ability to forecast scientific breakthroughs, agent aging benchmarks, the trade‑offs of complex harnesses, and multilingual food‑embedding models.

AI agentsAgent AgingAgent Distillation

0 likes · 18 min read

Top 10 AI Papers This Week: SkillOpt, Agent Distillation, and Sleeping LLMs

Machine Learning Algorithms & Natural Language Processing

Apr 8, 2026 · Artificial Intelligence

Dissecting Gemma‑4’s Architecture and Training Choices: A Technical Comparison with Qwen‑3 and GLM‑5

This article breaks down every architectural and training decision behind Gemma‑4—KV sharing, p‑RoPE, per‑layer embeddings, and a dual‑path MoE + dense MLP—while contrasting its efficiency and performance with Qwen‑3 and GLM‑5 across benchmarks, quantization strategies, and RL pipelines.

GLM-5Gemma 4LLM architecture

0 likes · 23 min read

Dissecting Gemma‑4’s Architecture and Training Choices: A Technical Comparison with Qwen‑3 and GLM‑5

DataFunTalk

Dec 24, 2021 · Artificial Intelligence

Large-Scale Pretrained Model Compression and Distillation: AdaBERT, L2A, and Meta‑KD

This article reviews three consecutive works from Alibaba DAMO Academy on compressing and distilling large pretrained language models—AdaBERT, L2A, and Meta‑KD—detailing their motivations, neural‑architecture‑search‑based designs, loss formulations, experimental results, and insights from a Q&A session.

AIKnowledge DistillationModel Compression

0 likes · 10 min read

Large-Scale Pretrained Model Compression and Distillation: AdaBERT, L2A, and Meta‑KD

Kuaishou Tech

May 14, 2026 · Artificial Intelligence

Open‑Source Kwai Summary Attention (KSA): A Sequence‑Compression Mechanism for Long‑Context Inference

KSA inserts learnable summary tokens to compress KV cache by a factor of eight, enabling accurate long‑context retrieval with far lower memory and compute costs, and it consistently outperforms full‑attention and other hybrid methods on large‑scale benchmarks.

Efficient InferenceKSAKV cache reduction

0 likes · 13 min read

Open‑Source Kwai Summary Attention (KSA): A Sequence‑Compression Mechanism for Long‑Context Inference

Machine Learning Algorithms & Natural Language Processing

Feb 26, 2026 · Artificial Intelligence

Why Longer Token Chains Don't Mean Better Reasoning: Google's Deep Thinking Ratio

Google’s recent study shows that the length of a model’s token chain is negatively correlated with inference accuracy, and introduces the Deep Thinking Ratio (DTR) metric to identify truly reasoning tokens, enabling the Think@n strategy to halve compute cost without sacrificing performance.

Deep Thinking RatioLLMThink@n

0 likes · 6 min read

Why Longer Token Chains Don't Mean Better Reasoning: Google's Deep Thinking Ratio

Machine Heart

May 13, 2026 · Artificial Intelligence

Why Bigger Teachers Don’t Teach Better: Tsinghua’s On‑Policy Distillation Study

Recent research by Tsinghua and collaborators dissects On‑Policy Distillation for large language models, revealing that higher‑scoring teachers often fail to improve students unless their thinking patterns align, detailing token‑level overlap dynamics, failure cases, and two practical remedies to rescue ineffective distillation.

Model ScalingRL Post-TrainingTeacher-Student Alignment

0 likes · 9 min read

Why Bigger Teachers Don’t Teach Better: Tsinghua’s On‑Policy Distillation Study