Collection size

98 articles

Page 4 of 5

Nov 6, 2024 · Artificial Intelligence

Unlocking Long-Text Video Understanding and LLM Distillation with Alibaba PAI

Alibaba Cloud’s AI platform PAI recently saw two papers accepted at EMNLP2024—VideoCLIP‑XL, which enhances video‑text representation for long descriptions using a large video‑long‑description dataset and novel pre‑training tasks, and TAPIR, a curriculum‑planning framework that distills instruction‑following abilities of large language models—while also releasing associated models, datasets, and integration tools for users.

EMNLP2024Multimodaldistillation

0 likes · 8 min read

Unlocking Long-Text Video Understanding and LLM Distillation with Alibaba PAI

Tencent Tech

Oct 27, 2025 · Artificial Intelligence

How SpecExit Cuts Large Reasoning Model Inference Time by Up to 2.5×

SpecExit combines early‑exit and speculative decoding to let large reasoning models detect when they have almost finished thinking, trimming redundant chain‑of‑thought steps, reducing over‑thinking by 72% and achieving up to 2.5× faster end‑to‑end inference without noticeable accuracy loss.

AIInference AccelerationSpeculative Decoding

0 likes · 6 min read

How SpecExit Cuts Large Reasoning Model Inference Time by Up to 2.5×

Baobao Algorithm Notes

Apr 16, 2024 · Artificial Intelligence

Merging Large Language Models Without GPUs: Task Vector, SLERP, TIES & DARE Explained

This article introduces four advanced model‑merging algorithms—Task Vector, SLERP, TIES, and DARE—explains their underlying principles, compares their strengths, and demonstrates a practical merge of Mistral‑7B, WizardMath‑7B and CodeLlama‑7B using the open‑source MergeKit toolkit.

AIDAREMergeKit

0 likes · 10 min read

Merging Large Language Models Without GPUs: Task Vector, SLERP, TIES & DARE Explained

Old Zhang's AI Learning

May 11, 2026 · Artificial Intelligence

Ling-2.6-1T: 1T‑Parameter, Fast‑Thinking, Agent‑Ready Model After DeepSeek‑V4

Ant Group's Ling‑2.6‑1T, a 1‑trillion‑parameter LLM built for token efficiency and fast‑thinking, outperforms on elite reasoning and agentic benchmarks, offers easy local deployment via vLLM or SGLang, provides a quantized 3.6‑bit version, and includes practical usage tips for developers and knowledge workers.

Agentic ModelClaude Code IntegrationLing-2.6-1T

0 likes · 12 min read

Ling-2.6-1T: 1T‑Parameter, Fast‑Thinking, Agent‑Ready Model After DeepSeek‑V4

Machine Heart

May 21, 2026 · Artificial Intelligence

Can Small Models Overthink? TaH Skips 93% Redundant Iterations and Boosts Accuracy

TaH, a selective latent‑iteration method for small language models, identifies and avoids unnecessary token‑level loops, cutting about 93% of extra iterations while delivering a stable 3.0%‑6.8% accuracy gain across nine math, QA, and code benchmarks.

LLM reasoningLooped TransformerTaH

0 likes · 14 min read

Can Small Models Overthink? TaH Skips 93% Redundant Iterations and Boosts Accuracy

Baobao Algorithm Notes

May 26, 2026 · Artificial Intelligence

How On-Policy Distillation (OPD) Solves Core Challenges in Large-Model Post-Training

The article explains how On-Policy Distillation (OPD) combines on‑policy sampling with dense teacher feedback via reverse KL to address low signal density, distribution shift, and capability interference in large‑model post‑training, and compares implementations by Qwen3, GLM‑5, MiMo‑V2 and DeepSeek‑V4.

Knowledge DistillationModel CompressionOPD

0 likes · 20 min read

How On-Policy Distillation (OPD) Solves Core Challenges in Large-Model Post-Training

Baobao Algorithm Notes

Aug 14, 2025 · Artificial Intelligence

Why Standard SFT Fails to Generalize and How One‑Line Dynamic Fine‑Tuning Fixes It

The article analyzes the poor generalization of supervised fine‑tuning (SFT) for large language models, reveals its gradient as a high‑variance inverse‑probability policy gradient, proposes a one‑line Dynamic Fine‑Tuning correction, and shows substantial gains on challenging math and offline RL benchmarks.

Dynamic Fine-TuningGeneralizationLLM alignment

0 likes · 7 min read

Why Standard SFT Fails to Generalize and How One‑Line Dynamic Fine‑Tuning Fixes It

Machine Learning Algorithms & Natural Language Processing

May 14, 2026 · Artificial Intelligence

Boosting LLM Pre‑training 2.5× Without Architecture Changes or Extra Compute

Nous Research introduces Token Superposition Training, which groups tokens into bags, averages their embeddings, and predicts token groups without altering model architecture or adding compute, achieving up to 2.5× faster pre‑training while maintaining standard inference.

LLM pretrainingMCE lossMoE

0 likes · 10 min read

Boosting LLM Pre‑training 2.5× Without Architecture Changes or Extra Compute

Tencent Cloud Developer

Mar 3, 2022 · Artificial Intelligence

Model Distillation for Query-Document Matching: Techniques and Optimizations

We applied knowledge distillation to a video query‑document BERT matcher, compressing the 12‑layer teacher into production‑ready 1‑layer ALBERT and tiny TextCNN students using combined soft, hard, and relevance losses plus AutoML‑tuned hyper‑parameters, achieving sub‑5 ms latency and up to 2.4% AUC improvement over the original model.

ALBERTAutoMLBERT

0 likes · 12 min read

Model Distillation for Query-Document Matching: Techniques and Optimizations

Baobao Algorithm Notes

Nov 19, 2024 · Artificial Intelligence

Demystifying OpenRLHF Loss Functions: From GPTLM to KTO and Beyond

This article walks through the various loss functions used in OpenRLHF—including GPTLMLoss, KDLoss, DPOLoss, KTOLoss, and reward model losses—explaining their mathematical foundations, implementation details, and practical considerations for RLHF training.

DPOKTOKnowledge Distillation

0 likes · 23 min read

Demystifying OpenRLHF Loss Functions: From GPTLM to KTO and Beyond

DaTaobao Tech

Sep 27, 2023 · Artificial Intelligence

FlashAttention-2: Efficient Attention Algorithm for Transformer Acceleration and AIGC Applications

FlashAttention‑2 is an IO‑aware exact attention algorithm that cuts GPU HBM traffic through tiling and recomputation, optimizes non‑matmul FLOPs, expands sequence‑parallelism and warp‑level work distribution, delivering up to 2× speedup over FlashAttention, near‑GEMM efficiency, and enabling longer‑context Transformer training and inference for AIGC with fastunet and negligible accuracy loss.

AIGCAttention optimizationFlashAttention-2

0 likes · 20 min read

FlashAttention-2: Efficient Attention Algorithm for Transformer Acceleration and AIGC Applications

PaperAgent

Apr 26, 2026 · Artificial Intelligence

ICLR 2026 Outstanding Papers Reveal the Real Test for LLMs

The ICLR 2026 Outstanding Paper awards spotlight two studies—one proving Transformers are mathematically succinct and another showing that all major LLMs lose about 39% performance in multi‑turn conversations, exposing a reliability gap missed by single‑turn benchmarks.

AI benchmarksICLR 2026LLM evaluation

0 likes · 7 min read

ICLR 2026 Outstanding Papers Reveal the Real Test for LLMs

Machine Heart

Apr 22, 2026 · Artificial Intelligence

Apple Turns Transformers into Mamba with Linear‑Cost Distillation

Apple proposes a two‑step cross‑architecture distillation that converts expensive, high‑performing Transformers into cheaper, nearly equally strong Mamba models by first replacing softmax attention with learned linear attention (Hedgehog) and then embedding this intermediate form into Mamba, achieving comparable perplexity and downstream task performance with far lower inference cost.

Artificial IntelligenceLinear AttentionMamba

0 likes · 7 min read

Apple Turns Transformers into Mamba with Linear‑Cost Distillation

DataFunTalk

Feb 28, 2025 · Artificial Intelligence

DeepSeek LLM Series (V1‑V3) and R1: Architecture, Training Strategies, Evaluation, and Distillation

An in‑depth overview of the DeepSeek LLM series (V1‑V3) and the R1 models, covering their architectures, scaling‑law experiments, data pipelines, training strategies—including MoE, MLA, FP8, multi‑step learning‑rate scheduling, reinforcement learning, and extensive evaluation results, as well as knowledge‑distillation techniques.

Mixture of Expertsscaling laws

0 likes · 36 min read

DeepSeek LLM Series (V1‑V3) and R1: Architecture, Training Strategies, Evaluation, and Distillation

Machine Learning Algorithms & Natural Language Processing

Feb 10, 2026 · Artificial Intelligence

Why Self‑Distillation Is the 2026 Keyword for Continual Learning in Large Models

At the start of 2026, self‑distillation dominates the most cited LLM papers, offering a teacher‑free way for large models to continually acquire new knowledge while preserving existing capabilities.

Reasoningcontinual learninglarge language models

0 likes · 9 min read

Why Self‑Distillation Is the 2026 Keyword for Continual Learning in Large Models

Tencent Technical Engineering

Oct 31, 2025 · Artificial Intelligence

How SpecExit Cuts LLM Reasoning Chains by 66% and Boosts Inference Speed 2.5×

SpecExit combines speculative sampling with a lightweight draft model to predict early‑exit signals, shortening large‑reasoning model chains by up to two‑thirds and achieving up to 2.5× end‑to‑end inference acceleration on vLLM without sacrificing accuracy.

AI efficiencyEarly StoppingInference Optimization

0 likes · 12 min read

How SpecExit Cuts LLM Reasoning Chains by 66% and Boosts Inference Speed 2.5×

Machine Learning Algorithms & Natural Language Processing

Mar 11, 2026 · Artificial Intelligence

Why LLMs Overthink: ICLR2026 Study Reveals the Key Bottleneck in Inference Efficiency

The ICLR2026 paper identifies reasoning miscalibration—overthinking easy steps and underthinking critical ones—as the root cause of runaway LLM inference costs, and proposes the Budget Allocation Model (BAM) and a training‑free Plan‑and‑Budget framework that smartly distributes compute, achieving up to 70% higher accuracy while cutting token usage by 39% and boosting the new E³ efficiency metric by 193.8%.

Budget Allocation ModelE3 MetricEpistemic Uncertainty

0 likes · 12 min read

Why LLMs Overthink: ICLR2026 Study Reveals the Key Bottleneck in Inference Efficiency

AI Frontier Lectures

Dec 9, 2025 · Artificial Intelligence

Can Token‑Level Surrogates Stabilize RL for Large Language Models? A Deep Dive

This article analyzes why optimizing sequence‑level rewards for LLMs with token‑level surrogate objectives can improve reinforcement‑learning stability, explains the theoretical conditions required, introduces Routing Replay for MoE models, and presents extensive experiments validating the approach.

Importance SamplingMixture of Expertslarge language models

0 likes · 12 min read

Can Token‑Level Surrogates Stabilize RL for Large Language Models? A Deep Dive

Baobao Algorithm Notes

May 6, 2024 · Artificial Intelligence

DeepSeek-V2: 236B MoE LLM Delivers Higher Performance While Cutting Training Cost by 42%

DeepSeek‑V2 is a 236‑billion‑parameter mixture‑of‑experts language model that reduces training cost by 42.5 %, cuts KV‑cache usage by 93.3 %, and boosts generation throughput 5.76×, while achieving state‑of‑the‑art scores on benchmarks such as MMLU, C‑Eval, BBH, HumanEval, and GSM8K for both base and chat variants.

AIDeepSeek-V2Large Language Model

0 likes · 11 min read

DeepSeek-V2: 236B MoE LLM Delivers Higher Performance While Cutting Training Cost by 42%

Alibaba Cloud Big Data AI Platform

Nov 8, 2024 · Artificial Intelligence

How TAPIR Boosts Small LLMs with Task‑Aware Curriculum Planning

The paper introduces TAPIR, a task‑aware curriculum planning framework that distills instruction‑following abilities from black‑box LLM teachers into smaller student models by filtering difficult prompts, resampling tasks, enhancing response styles, and iteratively optimizing across multiple training rounds, achieving superior performance on benchmark evaluations.

Curriculum LearningInstruction TuningKnowledge Distillation

0 likes · 10 min read

How TAPIR Boosts Small LLMs with Task‑Aware Curriculum Planning