Collection size
98 articles
Page 3 of 5
Architect's Alchemy Furnace
Architect's Alchemy Furnace
Feb 6, 2025 · Artificial Intelligence

How Knowledge Distillation Powers Efficient Large‑Model Deployment

This article explains how knowledge distillation enables massive AI models to be compressed and deployed efficiently, covering its principles, classification dimensions, implementation steps, innovative practices at DeepSeek, real‑world applications, and future research directions.

Artificial IntelligenceDeepSeekKnowledge Distillation
0 likes · 11 min read
How Knowledge Distillation Powers Efficient Large‑Model Deployment
Architect's Guide
Architect's Guide
May 13, 2025 · Artificial Intelligence

DeepSeek Model Distillation Technology: Overview, Innovations, Architecture, Training, Performance, and Challenges

This article provides a comprehensive overview of DeepSeek's model distillation technology, detailing its definition, key innovations, architecture, training methods, performance gains, and the remaining challenges such as the implicit performance ceiling and multimodal data distillation.

DeepSeekKnowledge Transferai-optimization
0 likes · 14 min read
DeepSeek Model Distillation Technology: Overview, Innovations, Architecture, Training, Performance, and Challenges
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
May 16, 2026 · Artificial Intelligence

Token Superposition Training Accelerates LLM Pre‑training 2.5× Without Changing Architecture

Token Superposition Training (TST) speeds up large‑language‑model pre‑training by up to 2.5× without altering model architecture or compute budget, using a superposition phase that averages token embeddings into bags and predicts groups of tokens, followed by a standard recovery phase, as demonstrated on 10B‑parameter MoE and smaller models.

LLM pretrainingMCE lossMoE
0 likes · 10 min read
Token Superposition Training Accelerates LLM Pre‑training 2.5× Without Changing Architecture
Data Party THU
Data Party THU
Aug 22, 2025 · Artificial Intelligence

TwigVLM: How Tiny Branches Accelerate Large Vision‑Language Models

TwigVLM introduces a lightweight “twig” module that prunes visual tokens early and enables self‑speculative decoding, achieving up to 154% speedup on long‑text generation while preserving 96% of original LVLM accuracy, as demonstrated on LLaVA‑1.5‑7B and other benchmarks.

LVLMSpeculative DecodingToken Pruning
0 likes · 14 min read
TwigVLM: How Tiny Branches Accelerate Large Vision‑Language Models
Baobao Algorithm Notes
Baobao Algorithm Notes
May 26, 2025 · Artificial Intelligence

When Should Large Language Models Think? 10 Cutting‑Edge Strategies to Boost Reasoning Efficiency

This article reviews ten recent papers that tackle the over‑thinking problem in large language models by shortening chain‑of‑thought reasoning, introducing dynamic early‑exit, adaptive thinking triggers, and reinforcement‑learning‑based training, showing how models can maintain or improve accuracy while dramatically reducing token usage and latency.

AI researchModel Pruningadaptive inference
0 likes · 38 min read
When Should Large Language Models Think? 10 Cutting‑Edge Strategies to Boost Reasoning Efficiency
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Jul 23, 2025 · Artificial Intelligence

Unlock Efficient LLMs: How Alibaba’s PAI EasyDistill Powers Model Post‑Training

This article explains how Alibaba Cloud's AI platform PAI leverages the EasyDistill framework for post‑training model optimization, covering knowledge distillation concepts, data synthesis techniques, basic and advanced distillation training, the DistilQwen model family, real‑world customer cases, and step‑by‑step practical demos.

AI platformEasyDistillKnowledge Distillation
0 likes · 12 min read
Unlock Efficient LLMs: How Alibaba’s PAI EasyDistill Powers Model Post‑Training
Baobao Algorithm Notes
Baobao Algorithm Notes
Dec 24, 2023 · Artificial Intelligence

Must‑Read AI Agent and LLM Research Papers for Deep Understanding

This curated reading list compiles essential papers on AI agents, task planning, hallucination mitigation, multimodal models, image/video generation, foundational LLM research, open‑source large models, fine‑tuning techniques, and performance optimization, providing a comprehensive roadmap for anyone aiming to master modern generative AI.

AI AgentsMultimodal LearningPerformance Optimization
0 likes · 23 min read
Must‑Read AI Agent and LLM Research Papers for Deep Understanding
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Apr 14, 2026 · Artificial Intelligence

Revisiting On-Policy Distillation (OPD): Typical Failures and a More Stable Fix

On‑Policy Distillation (OPD) is widely used for post‑training large language models, but the sampled‑token variant often becomes unstable due to token‑level reward imbalance, teacher‑student signal mismatch on student‑generated prefixes, and tokenizer mismatches; this article analyses the bias‑variance trade‑off, identifies three root failure modes, and proposes a teacher‑top‑K local‑support‑set objective with top‑p rollout and special‑token masking that yields more stable training and better performance on both math and agentic benchmarks.

OPDlarge language modelson-policy distillation
0 likes · 32 min read
Revisiting On-Policy Distillation (OPD): Typical Failures and a More Stable Fix
Architects' Tech Alliance
Architects' Tech Alliance
Feb 16, 2025 · Artificial Intelligence

How DeepSeek’s Distillation Breaks Bottlenecks and Boosts Multimodal AI Performance

This article provides an in‑depth technical analysis of DeepSeek’s model distillation technology, covering its core principles, innovative data‑model fusion strategies, architecture design, training optimizations, performance benchmarks, and the remaining challenges of scaling distillation to multimodal tasks.

DeepSeekai-optimizationlarge language models
0 likes · 16 min read
How DeepSeek’s Distillation Breaks Bottlenecks and Boosts Multimodal AI Performance
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Jan 27, 2026 · Artificial Intelligence

Deploying Qwen3 on Kunlun P800: Full‑Parameter DPO Training and Inference Guide

This guide walks through setting up a Kunlun P800 XPU host, preparing Docker containers, deploying Qwen3‑8B/‑32B/‑VL models with vLLM‑Kunlun, benchmarking performance, and running full‑parameter DPO training using LLaMA‑Factory, providing scripts, configuration files, and troubleshooting tips for AI engineers.

DPOKunlun P800LLaMA-Factory
0 likes · 32 min read
Deploying Qwen3 on Kunlun P800: Full‑Parameter DPO Training and Inference Guide
Baobao Algorithm Notes
Baobao Algorithm Notes
Mar 28, 2025 · Artificial Intelligence

Can Small 7B Models Beat the State‑of‑the‑Art? A Critical Analysis of R1‑Zero Training and Unbiased GRPO

This article critically examines R1‑Zero‑style training by analyzing foundation models and reinforcement learning, uncovering pre‑training and optimization biases, proposing an unbiased Dr. GRPO method, and demonstrating a minimalist 7B‑model recipe that achieves new state‑of‑the‑art performance on AIME 2024.

GRPOLLM evaluationR1-Zero
0 likes · 20 min read
Can Small 7B Models Beat the State‑of‑the‑Art? A Critical Analysis of R1‑Zero Training and Unbiased GRPO
DataFunTalk
DataFunTalk
Mar 16, 2022 · Artificial Intelligence

Parameter-Efficient Sparsity Training for the PLUG Large-Scale Language Model

This article presents the PLUG 270‑billion‑parameter Chinese language model and introduces a parameter‑efficient sparsity training (PST) framework that combines unstructured and structured pruning with low‑rank decomposition to dramatically reduce model size while preserving downstream performance.

Deep LearningPLUGParameter-Efficient Training
0 likes · 13 min read
Parameter-Efficient Sparsity Training for the PLUG Large-Scale Language Model
SuanNi
SuanNi
May 16, 2026 · Artificial Intelligence

Can a 4B Small Model Replace Top‑Tier Closed‑Source LLMs? Microsoft’s Terminus‑4B Cuts Token Use by 30%

Microsoft’s research shows that a 4‑billion‑parameter small model, Terminus‑4B, can act as an execution sub‑agent for terminal tasks, trimming token consumption by about 30% while preserving performance on demanding SWE‑Bench benchmarks, demonstrating a practical alternative to costly large models.

AI programmingRL trainingSWE-bench
0 likes · 7 min read
Can a 4B Small Model Replace Top‑Tier Closed‑Source LLMs? Microsoft’s Terminus‑4B Cuts Token Use by 30%
Machine Heart
Machine Heart
May 12, 2026 · Artificial Intelligence

DECS Cuts Overthinking in Models: Halve Inference Tokens and Raise Accuracy

DECS, a novel training framework introduced by researchers from Fudan, Shanghai Jiao Tong, and the Shanghai AI Lab, theoretically exposes the flaws of length‑penalty rewards and, through token‑level reward decoupling and dynamic batch scheduling, reduces inference token counts by over 50% while improving accuracy across multiple benchmarks.

DECSbenchmark evaluationinference efficiency
0 likes · 9 min read
DECS Cuts Overthinking in Models: Halve Inference Tokens and Raise Accuracy
Alimama Tech
Alimama Tech
Oct 29, 2025 · Artificial Intelligence

LLM Breakthroughs at EMNLP 2025: Embedding Compression, Complex Instructions, Knowledge Scaling

EMNLP 2025 in Suzhou showcases Taobao's booth featuring four cutting‑edge AI papers that introduce a novel embedding compression framework, an automatic iterative refinement method for complex instruction generation, a knowledge infusion scaling law for large language models, and a video caption optimization approach for text‑to‑video generation.

embedding compressioninstruction generationknowledge infusion
0 likes · 7 min read
LLM Breakthroughs at EMNLP 2025: Embedding Compression, Complex Instructions, Knowledge Scaling
Ops Community
Ops Community
Jan 18, 2026 · Artificial Intelligence

How to Quadruple LLM Throughput with vLLM’s PagedAttention and Continuous Batching

This guide details how to replace native Transformers inference with the high‑performance vLLM engine, leveraging PagedAttention, continuous batching, tensor parallelism, and OpenAI‑compatible APIs to achieve 3‑4× higher throughput, lower latency, and scalable multi‑GPU deployments for production‑grade large language models.

GPU optimizationOpenAI API CompatibilityPagedAttention
0 likes · 61 min read
How to Quadruple LLM Throughput with vLLM’s PagedAttention and Continuous Batching
SuanNi
SuanNi
Apr 21, 2026 · Artificial Intelligence

How Qwen3.6‑35B‑A3B Matches Dense Models with Only 30 B Active Parameters

The article analyzes Qwen3.6‑35B‑A3B’s MoE architecture, showing how its 30 B active parameters outperform larger dense models across programming, agent, and multimodal benchmarks, and examines the flagship Qwen3.6‑Max‑Preview’s substantial gains in world knowledge, instruction following, and third‑party rankings.

AI evaluationBenchmarkLarge Language Model
0 likes · 5 min read
How Qwen3.6‑35B‑A3B Matches Dense Models with Only 30 B Active Parameters
PaperAgent
PaperAgent
Feb 15, 2026 · Artificial Intelligence

How MiniCPM‑SALA Merges Sparse and Linear Attention to Break Long‑Context Limits

MiniCPM‑SALA introduces a hybrid sparse‑linear attention architecture that reduces quadratic compute and memory costs, achieves state‑of‑the‑art performance on long‑context benchmarks, and delivers up to 3.5× faster inference than full‑attention models on sequences up to 1 million tokens.

LLMLinear Attentionlong-context
0 likes · 17 min read
How MiniCPM‑SALA Merges Sparse and Linear Attention to Break Long‑Context Limits
DataFunSummit
DataFunSummit
Dec 21, 2021 · Artificial Intelligence

Large‑Scale Pretrained Model Compression and Distillation: AdaBERT, L2A, and Meta‑KD

This talk presents Alibaba DAMO Academy’s recent work on compressing large pretrained language models, covering task‑adaptive AdaBERT, data‑augmented L2A, and meta‑knowledge distillation Meta‑KD, describing their motivations, architectures, NAS‑based search, loss designs, and experimental results across multiple NLP tasks.

Knowledge DistillationModel CompressionNLP
0 likes · 13 min read
Large‑Scale Pretrained Model Compression and Distillation: AdaBERT, L2A, and Meta‑KD