Tagged articles

19 articles

Page 1 of 1

May 30, 2026 · Artificial Intelligence

How USTC’s Tiny LCPO Training Cuts Large Model Overthinking in Half

The paper introduces LCPO, a lightweight preference‑optimization technique that uses only 800 training examples and 50 steps to teach large language models to produce concise, accurate answers, halving inference length while often improving accuracy and reducing training cost by up to two orders of magnitude.

Efficient InferenceLCPOLow-Resource Training

0 likes · 8 min read

How USTC’s Tiny LCPO Training Cuts Large Model Overthinking in Half

Machine Learning Algorithms & Natural Language Processing

May 20, 2026 · Artificial Intelligence

How 800 Data Points Halve LLM Chain‑of‑Thought Length and Boost Accuracy

The ICLR‑2026 paper introduces LCPO, a lightweight preference‑optimization technique that uses only 800 curated examples and 50 training steps to cut large‑model chain‑of‑thought generation length by about 50% while maintaining or even improving answer accuracy, dramatically reducing training and inference costs.

Efficient InferenceLCPOLow-Resource Training

0 likes · 8 min read

How 800 Data Points Halve LLM Chain‑of‑Thought Length and Boost Accuracy

Machine Heart

May 19, 2026 · Artificial Intelligence

How New LLM Architectures Like Gemma 4 and DeepSeek V4 Cut Long‑Context Costs

Recent open‑weight LLMs such as Gemma 4, Laguna XS.2, ZAYA1‑8B, and DeepSeek V4 introduce KV‑cache sharing, per‑layer embeddings, layer‑wise attention budgeting, and compressed attention mechanisms that dramatically reduce memory and compute overhead for very long contexts while preserving model quality.

Efficient InferenceKV sharingLLM

0 likes · 25 min read

How New LLM Architectures Like Gemma 4 and DeepSeek V4 Cut Long‑Context Costs

Xiaomi Tech

May 18, 2026 · Artificial Intelligence

Xiaomi’s Imaging Algorithms Win CVPR 2026 NTIRE: Super‑Resolution, Portrait Restoration, Reflection Removal Breakthroughs

Xiaomi secured three top spots at CVPR 2026 NTIRE—first in Efficient Super‑Resolution with SPANV2, first in Portrait Restoration using a dual‑stage cascade, and second in Reflection Removal via RDNet‑XL and diffusion‑model distillation—showcasing hardware‑software co‑design, ultra‑fast inference, and novel algorithmic innovations.

Efficient Inferencediffusion model distillationhardware-software co-design

0 likes · 14 min read

Xiaomi’s Imaging Algorithms Win CVPR 2026 NTIRE: Super‑Resolution, Portrait Restoration, Reflection Removal Breakthroughs

Machine Heart

May 18, 2026 · Artificial Intelligence

Can Large Models Reason Deeply with Only a Few Thinking Tokens?

The paper introduces Heima, a framework that compresses chain‑of‑thought reasoning into a small set of abstract “thinking tokens” for multimodal large models, dramatically reducing generated tokens while preserving inference capability, and provides an adaptive interpreter to reconstruct human‑readable reasoning for analysis.

Efficient Inferencechain-of-thoughtlatent reasoning

0 likes · 12 min read

Can Large Models Reason Deeply with Only a Few Thinking Tokens?

Kuaishou Tech

May 14, 2026 · Artificial Intelligence

Open‑Source Kwai Summary Attention (KSA): A Sequence‑Compression Mechanism for Long‑Context Inference

KSA inserts learnable summary tokens to compress KV cache by a factor of eight, enabling accurate long‑context retrieval with far lower memory and compute costs, and it consistently outperforms full‑attention and other hybrid methods on large‑scale benchmarks.

Efficient InferenceKSAKV cache reduction

0 likes · 13 min read

Open‑Source Kwai Summary Attention (KSA): A Sequence‑Compression Mechanism for Long‑Context Inference

Machine Heart

Apr 26, 2026 · Artificial Intelligence

Balanced Thinking: Boost LLM Accuracy by 10% While Cutting Inference Length 35%

The paper introduces ReBalance, a training‑free two‑stage inference control framework that uses model confidence signals to dynamically balance reasoning depth, achieving up to a 10‑point accuracy gain and a 35.4% reduction in token length across multiple LLM sizes and benchmarks.

Balanced ThinkingConfidence SteeringEfficient Inference

0 likes · 9 min read

Balanced Thinking: Boost LLM Accuracy by 10% While Cutting Inference Length 35%

Machine Learning Algorithms & Natural Language Processing

Apr 16, 2026 · Artificial Intelligence

Efficient Reasoning with Reward Shaping: Compressing Qwen 30B‑Series Chains by 20‑40%

The article analyzes how reward‑shaping techniques can shorten the chain‑of‑thought outputs of Qwen 30‑parameter series models by 20‑40% while preserving or slightly improving performance on AIME‑25 and out‑of‑distribution benchmarks, and it details the experimental design, strategic considerations, and practical insights behind this efficient reasoning approach.

Efficient InferenceQwenReward Shaping

0 likes · 16 min read

Efficient Reasoning with Reward Shaping: Compressing Qwen 30B‑Series Chains by 20‑40%

Machine Heart

Apr 12, 2026 · Artificial Intelligence

LRT: Implicit Reasoning Chains Boost Speed and Accuracy by Removing Redundant Steps

Researchers introduce Latent Reasoning Tuning (LRT), a lightweight inference network that encodes explicit reasoning chains into fixed‑length latent vectors, eliminating thousands of decoding steps; experiments reveal substantial redundancy in traditional chains and demonstrate that LRT achieves faster, more accurate inference and outperforms existing efficient reasoning methods.

DeepSeekEfficient InferenceHybrid Reasoning

0 likes · 10 min read

LRT: Implicit Reasoning Chains Boost Speed and Accuracy by Removing Redundant Steps

PaperAgent

Apr 8, 2026 · Artificial Intelligence

How Dynamic Computation Cuts Redundancy in Decoder-Only Multimodal LLMs

This article examines the visual token redundancy in decoder-only multimodal large language models and introduces a training-free dynamic computation reduction framework—featuring Probe-Activated Dynamic FFN, Hollow Attention, and a Layer Ranking Algorithm—that significantly lowers inference cost while preserving performance.

Efficient Inferencedecoder-only architecturedynamic computation

0 likes · 12 min read

How Dynamic Computation Cuts Redundancy in Decoder-Only Multimodal LLMs

Machine Learning Algorithms & Natural Language Processing

Feb 24, 2026 · Artificial Intelligence

How COMI Achieves 25‑Point Performance Gains at 32× Compression Using Marginal Information Gain (ICLR 2026)

The COMI framework introduces a marginal information gain metric and a coarse‑to‑fine adaptive compression strategy that preserves relevance and diversity, enabling 32× text compression while boosting downstream QA performance by up to 25 points and doubling inference speed.

Context CompressionEfficient InferenceLong-Context Retrieval

0 likes · 7 min read

How COMI Achieves 25‑Point Performance Gains at 32× Compression Using Marginal Information Gain (ICLR 2026)

AI Frontier Lectures

Feb 10, 2026 · Artificial Intelligence

Can an 8B Model Outperform GPT‑4 in Faithfulness Detection? Inside FaithLens

FaithLens is an 8‑billion‑parameter model that surpasses GPT‑4.1 and other large models on twelve hallucination‑detection benchmarks while providing high‑quality natural‑language explanations, thanks to a novel data‑synthesis pipeline, three‑dimensional filtering, and rule‑based reinforcement learning.

Efficient InferenceLLM hallucinationexplainable AI

0 likes · 12 min read

Can an 8B Model Outperform GPT‑4 in Faithfulness Detection? Inside FaithLens

PaperAgent

Jan 13, 2026 · Artificial Intelligence

How Engram’s Conditional Memory Redefines Sparsity in Large Language Models

DeepSeek’s newly released Engram module introduces a conditional memory mechanism that leverages O(1) N‑gram lookup to create a new sparsity axis for large language models, reducing early‑layer compute, improving inference efficiency, and delivering notable performance gains across reasoning and knowledge tasks, as demonstrated by extensive experiments on 27‑billion‑parameter models.

Efficient InferenceEngramLLM Sparsity

0 likes · 8 min read

How Engram’s Conditional Memory Redefines Sparsity in Large Language Models

Xiaomi Tech

Dec 17, 2025 · Artificial Intelligence

Xiaomi MiMo-V2-Flash Open‑Source: Ultra‑Efficient Inference and Agent‑Ready Model

Xiaomi's MiMo-V2-Flash, a 309B MoE model with hybrid attention and Multi‑Token Prediction acceleration, delivers top‑2 global agent benchmark scores, up to 2× faster inference, and only 2.5% of the cost of comparable closed‑source models, while being fully open‑source.

Efficient InferenceHybrid AttentionMOPD

0 likes · 7 min read

Xiaomi MiMo-V2-Flash Open‑Source: Ultra‑Efficient Inference and Agent‑Ready Model

21CTO

Nov 4, 2025 · Artificial Intelligence

LongCat-Flash-Omni: How an Open-Source 560B Model Achieves Real-Time Multimodal Mastery

LongCat-Flash-Omni, an open‑source 560 billion‑parameter multimodal model, combines efficient Shortcut‑Connected MoE architecture with advanced perception and speech modules to deliver low‑latency real‑time audio‑video interaction and state‑of‑the‑art performance across text, image, video, and audio tasks.

Efficient InferenceLarge Language ModelReal-Time Interaction

0 likes · 10 min read

LongCat-Flash-Omni: How an Open-Source 560B Model Achieves Real-Time Multimodal Mastery

AntTech

Oct 29, 2025 · Artificial Intelligence

Inside Ant’s Baoling: Balancing Efficiency and Reasoning in a 1‑Trillion‑Parameter Model

At the Ant Star Innovation Journey event, the Baoling team unveiled their roadmap for trillion‑parameter models, detailing the development of Ling‑1T, Ring‑1T and multimodal Ming series, the scaling‑law‑guided architecture, training innovations, evaluation methods, and open‑source releases that aim to advance efficient, high‑performance AI.

Efficient InferenceLarge Language ModelScaling Law

0 likes · 24 min read

Inside Ant’s Baoling: Balancing Efficiency and Reasoning in a 1‑Trillion‑Parameter Model

Meituan Technology Team

Oct 9, 2025 · Artificial Intelligence

How VSRM Cuts Redundant Reasoning Steps in Large Language Models

The paper introduces VSRM, a verifiable step‑reward mechanism that penalizes ineffective reasoning steps and rewards useful ones in large language model inference, dramatically shortening output length while preserving or even improving performance across multiple benchmarks and reinforcement‑learning algorithms.

AIEfficient Inferencelarge-language-models

0 likes · 10 min read

How VSRM Cuts Redundant Reasoning Steps in Large Language Models

AntTech

Sep 11, 2025 · Artificial Intelligence

Ling-mini-2.0: How a 16B MoE Model Achieves Dense-Level Performance with Only 1.4B Active Parameters

Ling-mini-2.0, an open-source 16 B MoE language model that activates only 1.4 B parameters, achieves dense-level performance with 7× efficiency, generates over 300 tokens / s, and introduces the first FP8 mixed-precision training suite, offering multiple pre-training checkpoints for the AI community.

Efficient InferenceFP8 trainingMoE

0 likes · 6 min read

Ling-mini-2.0: How a 16B MoE Model Achieves Dense-Level Performance with Only 1.4B Active Parameters

Alibaba Cloud Big Data AI Platform

Feb 25, 2025 · Artificial Intelligence

How DistilQwen2.5 Boosts LLM Efficiency with Dual‑Stage Knowledge Distillation

This article introduces DistilQwen2.5, a lightweight LLM series built on Qwen2.5 that uses a novel two‑layer distillation framework, instruction‑data optimization, and parameter‑fusion techniques to achieve higher performance while drastically reducing computational cost and deployment overhead.

Efficient InferenceKnowledge DistillationLLM

0 likes · 26 min read

How DistilQwen2.5 Boosts LLM Efficiency with Dual‑Stage Knowledge Distillation