Collection size
98 articles
Page 4 of 5
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Nov 6, 2024 · Artificial Intelligence

Unlocking Long-Text Video Understanding and LLM Distillation with Alibaba PAI

Alibaba Cloud’s AI platform PAI recently saw two papers accepted at EMNLP2024—VideoCLIP‑XL, which enhances video‑text representation for long descriptions using a large video‑long‑description dataset and novel pre‑training tasks, and TAPIR, a curriculum‑planning framework that distills instruction‑following abilities of large language models—while also releasing associated models, datasets, and integration tools for users.

EMNLP2024Multimodaldistillation
0 likes · 8 min read
Unlocking Long-Text Video Understanding and LLM Distillation with Alibaba PAI
Tencent Tech
Tencent Tech
Oct 27, 2025 · Artificial Intelligence

How SpecExit Cuts Large Reasoning Model Inference Time by Up to 2.5×

SpecExit combines early‑exit and speculative decoding to let large reasoning models detect when they have almost finished thinking, trimming redundant chain‑of‑thought steps, reducing over‑thinking by 72% and achieving up to 2.5× faster end‑to‑end inference without noticeable accuracy loss.

AIInference AccelerationSpeculative Decoding
0 likes · 6 min read
How SpecExit Cuts Large Reasoning Model Inference Time by Up to 2.5×
Old Zhang's AI Learning
Old Zhang's AI Learning
May 11, 2026 · Artificial Intelligence

Ling-2.6-1T: 1T‑Parameter, Fast‑Thinking, Agent‑Ready Model After DeepSeek‑V4

Ant Group's Ling‑2.6‑1T, a 1‑trillion‑parameter LLM built for token efficiency and fast‑thinking, outperforms on elite reasoning and agentic benchmarks, offers easy local deployment via vLLM or SGLang, provides a quantized 3.6‑bit version, and includes practical usage tips for developers and knowledge workers.

Agentic ModelClaude Code IntegrationLing-2.6-1T
0 likes · 12 min read
Ling-2.6-1T: 1T‑Parameter, Fast‑Thinking, Agent‑Ready Model After DeepSeek‑V4
Baobao Algorithm Notes
Baobao Algorithm Notes
May 26, 2026 · Artificial Intelligence

How On-Policy Distillation (OPD) Solves Core Challenges in Large-Model Post-Training

The article explains how On-Policy Distillation (OPD) combines on‑policy sampling with dense teacher feedback via reverse KL to address low signal density, distribution shift, and capability interference in large‑model post‑training, and compares implementations by Qwen3, GLM‑5, MiMo‑V2 and DeepSeek‑V4.

Knowledge DistillationModel CompressionOPD
0 likes · 20 min read
How On-Policy Distillation (OPD) Solves Core Challenges in Large-Model Post-Training
Baobao Algorithm Notes
Baobao Algorithm Notes
Aug 14, 2025 · Artificial Intelligence

Why Standard SFT Fails to Generalize and How One‑Line Dynamic Fine‑Tuning Fixes It

The article analyzes the poor generalization of supervised fine‑tuning (SFT) for large language models, reveals its gradient as a high‑variance inverse‑probability policy gradient, proposes a one‑line Dynamic Fine‑Tuning correction, and shows substantial gains on challenging math and offline RL benchmarks.

Dynamic Fine-TuningGeneralizationLLM alignment
0 likes · 7 min read
Why Standard SFT Fails to Generalize and How One‑Line Dynamic Fine‑Tuning Fixes It
Tencent Cloud Developer
Tencent Cloud Developer
Mar 3, 2022 · Artificial Intelligence

Model Distillation for Query-Document Matching: Techniques and Optimizations

We applied knowledge distillation to a video query‑document BERT matcher, compressing the 12‑layer teacher into production‑ready 1‑layer ALBERT and tiny TextCNN students using combined soft, hard, and relevance losses plus AutoML‑tuned hyper‑parameters, achieving sub‑5 ms latency and up to 2.4% AUC improvement over the original model.

ALBERTAutoMLBERT
0 likes · 12 min read
Model Distillation for Query-Document Matching: Techniques and Optimizations
DaTaobao Tech
DaTaobao Tech
Sep 27, 2023 · Artificial Intelligence

FlashAttention-2: Efficient Attention Algorithm for Transformer Acceleration and AIGC Applications

FlashAttention‑2 is an IO‑aware exact attention algorithm that cuts GPU HBM traffic through tiling and recomputation, optimizes non‑matmul FLOPs, expands sequence‑parallelism and warp‑level work distribution, delivering up to 2× speedup over FlashAttention, near‑GEMM efficiency, and enabling longer‑context Transformer training and inference for AIGC with fastunet and negligible accuracy loss.

AIGCAttention optimizationFlashAttention-2
0 likes · 20 min read
FlashAttention-2: Efficient Attention Algorithm for Transformer Acceleration and AIGC Applications
PaperAgent
PaperAgent
Apr 26, 2026 · Artificial Intelligence

ICLR 2026 Outstanding Papers Reveal the Real Test for LLMs

The ICLR 2026 Outstanding Paper awards spotlight two studies—one proving Transformers are mathematically succinct and another showing that all major LLMs lose about 39% performance in multi‑turn conversations, exposing a reliability gap missed by single‑turn benchmarks.

AI benchmarksICLR 2026LLM evaluation
0 likes · 7 min read
ICLR 2026 Outstanding Papers Reveal the Real Test for LLMs
Machine Heart
Machine Heart
Apr 22, 2026 · Artificial Intelligence

Apple Turns Transformers into Mamba with Linear‑Cost Distillation

Apple proposes a two‑step cross‑architecture distillation that converts expensive, high‑performing Transformers into cheaper, nearly equally strong Mamba models by first replacing softmax attention with learned linear attention (Hedgehog) and then embedding this intermediate form into Mamba, achieving comparable perplexity and downstream task performance with far lower inference cost.

Artificial IntelligenceLinear AttentionMamba
0 likes · 7 min read
Apple Turns Transformers into Mamba with Linear‑Cost Distillation
DataFunTalk
DataFunTalk
Feb 28, 2025 · Artificial Intelligence

DeepSeek LLM Series (V1‑V3) and R1: Architecture, Training Strategies, Evaluation, and Distillation

An in‑depth overview of the DeepSeek LLM series (V1‑V3) and the R1 models, covering their architectures, scaling‑law experiments, data pipelines, training strategies—including MoE, MLA, FP8, multi‑step learning‑rate scheduling, reinforcement learning, and extensive evaluation results, as well as knowledge‑distillation techniques.

Mixture of Expertsscaling laws
0 likes · 36 min read
DeepSeek LLM Series (V1‑V3) and R1: Architecture, Training Strategies, Evaluation, and Distillation
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Mar 11, 2026 · Artificial Intelligence

Why LLMs Overthink: ICLR2026 Study Reveals the Key Bottleneck in Inference Efficiency

The ICLR2026 paper identifies reasoning miscalibration—overthinking easy steps and underthinking critical ones—as the root cause of runaway LLM inference costs, and proposes the Budget Allocation Model (BAM) and a training‑free Plan‑and‑Budget framework that smartly distributes compute, achieving up to 70% higher accuracy while cutting token usage by 39% and boosting the new E³ efficiency metric by 193.8%.

Budget Allocation ModelE3 MetricEpistemic Uncertainty
0 likes · 12 min read
Why LLMs Overthink: ICLR2026 Study Reveals the Key Bottleneck in Inference Efficiency
AI Frontier Lectures
AI Frontier Lectures
Dec 9, 2025 · Artificial Intelligence

Can Token‑Level Surrogates Stabilize RL for Large Language Models? A Deep Dive

This article analyzes why optimizing sequence‑level rewards for LLMs with token‑level surrogate objectives can improve reinforcement‑learning stability, explains the theoretical conditions required, introduces Routing Replay for MoE models, and presents extensive experiments validating the approach.

Importance SamplingMixture of Expertslarge language models
0 likes · 12 min read
Can Token‑Level Surrogates Stabilize RL for Large Language Models? A Deep Dive
Baobao Algorithm Notes
Baobao Algorithm Notes
May 6, 2024 · Artificial Intelligence

DeepSeek-V2: 236B MoE LLM Delivers Higher Performance While Cutting Training Cost by 42%

DeepSeek‑V2 is a 236‑billion‑parameter mixture‑of‑experts language model that reduces training cost by 42.5 %, cuts KV‑cache usage by 93.3 %, and boosts generation throughput 5.76×, while achieving state‑of‑the‑art scores on benchmarks such as MMLU, C‑Eval, BBH, HumanEval, and GSM8K for both base and chat variants.

AIDeepSeek-V2Large Language Model
0 likes · 11 min read
DeepSeek-V2: 236B MoE LLM Delivers Higher Performance While Cutting Training Cost by 42%
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Nov 8, 2024 · Artificial Intelligence

How TAPIR Boosts Small LLMs with Task‑Aware Curriculum Planning

The paper introduces TAPIR, a task‑aware curriculum planning framework that distills instruction‑following abilities from black‑box LLM teachers into smaller student models by filtering difficult prompts, resampling tasks, enhancing response styles, and iteratively optimizing across multiple training rounds, achieving superior performance on benchmark evaluations.

Curriculum LearningInstruction TuningKnowledge Distillation
0 likes · 10 min read
How TAPIR Boosts Small LLMs with Task‑Aware Curriculum Planning