Tagged articles
19 articles
Page 1 of 1
Data Party THU
Data Party THU
May 30, 2026 · Artificial Intelligence

How USTC’s Tiny LCPO Training Cuts Large Model Overthinking in Half

The paper introduces LCPO, a lightweight preference‑optimization technique that uses only 800 training examples and 50 steps to teach large language models to produce concise, accurate answers, halving inference length while often improving accuracy and reducing training cost by up to two orders of magnitude.

Efficient InferenceLCPOLow-Resource Training
0 likes · 8 min read
How USTC’s Tiny LCPO Training Cuts Large Model Overthinking in Half
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
May 20, 2026 · Artificial Intelligence

How 800 Data Points Halve LLM Chain‑of‑Thought Length and Boost Accuracy

The ICLR‑2026 paper introduces LCPO, a lightweight preference‑optimization technique that uses only 800 curated examples and 50 training steps to cut large‑model chain‑of‑thought generation length by about 50% while maintaining or even improving answer accuracy, dramatically reducing training and inference costs.

Efficient InferenceLCPOLow-Resource Training
0 likes · 8 min read
How 800 Data Points Halve LLM Chain‑of‑Thought Length and Boost Accuracy
Machine Heart
Machine Heart
May 19, 2026 · Artificial Intelligence

How New LLM Architectures Like Gemma 4 and DeepSeek V4 Cut Long‑Context Costs

Recent open‑weight LLMs such as Gemma 4, Laguna XS.2, ZAYA1‑8B, and DeepSeek V4 introduce KV‑cache sharing, per‑layer embeddings, layer‑wise attention budgeting, and compressed attention mechanisms that dramatically reduce memory and compute overhead for very long contexts while preserving model quality.

Efficient InferenceKV sharingLLM
0 likes · 25 min read
How New LLM Architectures Like Gemma 4 and DeepSeek V4 Cut Long‑Context Costs
Xiaomi Tech
Xiaomi Tech
May 18, 2026 · Artificial Intelligence

Xiaomi’s Imaging Algorithms Win CVPR 2026 NTIRE: Super‑Resolution, Portrait Restoration, Reflection Removal Breakthroughs

Xiaomi secured three top spots at CVPR 2026 NTIRE—first in Efficient Super‑Resolution with SPANV2, first in Portrait Restoration using a dual‑stage cascade, and second in Reflection Removal via RDNet‑XL and diffusion‑model distillation—showcasing hardware‑software co‑design, ultra‑fast inference, and novel algorithmic innovations.

Efficient Inferencediffusion model distillationhardware-software co-design
0 likes · 14 min read
Xiaomi’s Imaging Algorithms Win CVPR 2026 NTIRE: Super‑Resolution, Portrait Restoration, Reflection Removal Breakthroughs
Machine Heart
Machine Heart
May 18, 2026 · Artificial Intelligence

Can Large Models Reason Deeply with Only a Few Thinking Tokens?

The paper introduces Heima, a framework that compresses chain‑of‑thought reasoning into a small set of abstract “thinking tokens” for multimodal large models, dramatically reducing generated tokens while preserving inference capability, and provides an adaptive interpreter to reconstruct human‑readable reasoning for analysis.

Efficient Inferencechain-of-thoughtlatent reasoning
0 likes · 12 min read
Can Large Models Reason Deeply with Only a Few Thinking Tokens?
Machine Heart
Machine Heart
Apr 26, 2026 · Artificial Intelligence

Balanced Thinking: Boost LLM Accuracy by 10% While Cutting Inference Length 35%

The paper introduces ReBalance, a training‑free two‑stage inference control framework that uses model confidence signals to dynamically balance reasoning depth, achieving up to a 10‑point accuracy gain and a 35.4% reduction in token length across multiple LLM sizes and benchmarks.

Balanced ThinkingConfidence SteeringEfficient Inference
0 likes · 9 min read
Balanced Thinking: Boost LLM Accuracy by 10% While Cutting Inference Length 35%
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Apr 16, 2026 · Artificial Intelligence

Efficient Reasoning with Reward Shaping: Compressing Qwen 30B‑Series Chains by 20‑40%

The article analyzes how reward‑shaping techniques can shorten the chain‑of‑thought outputs of Qwen 30‑parameter series models by 20‑40% while preserving or slightly improving performance on AIME‑25 and out‑of‑distribution benchmarks, and it details the experimental design, strategic considerations, and practical insights behind this efficient reasoning approach.

Efficient InferenceQwenReward Shaping
0 likes · 16 min read
Efficient Reasoning with Reward Shaping: Compressing Qwen 30B‑Series Chains by 20‑40%
Machine Heart
Machine Heart
Apr 12, 2026 · Artificial Intelligence

LRT: Implicit Reasoning Chains Boost Speed and Accuracy by Removing Redundant Steps

Researchers introduce Latent Reasoning Tuning (LRT), a lightweight inference network that encodes explicit reasoning chains into fixed‑length latent vectors, eliminating thousands of decoding steps; experiments reveal substantial redundancy in traditional chains and demonstrate that LRT achieves faster, more accurate inference and outperforms existing efficient reasoning methods.

DeepSeekEfficient InferenceHybrid Reasoning
0 likes · 10 min read
LRT: Implicit Reasoning Chains Boost Speed and Accuracy by Removing Redundant Steps
PaperAgent
PaperAgent
Apr 8, 2026 · Artificial Intelligence

How Dynamic Computation Cuts Redundancy in Decoder-Only Multimodal LLMs

This article examines the visual token redundancy in decoder-only multimodal large language models and introduces a training-free dynamic computation reduction framework—featuring Probe-Activated Dynamic FFN, Hollow Attention, and a Layer Ranking Algorithm—that significantly lowers inference cost while preserving performance.

Efficient Inferencedecoder-only architecturedynamic computation
0 likes · 12 min read
How Dynamic Computation Cuts Redundancy in Decoder-Only Multimodal LLMs
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Feb 24, 2026 · Artificial Intelligence

How COMI Achieves 25‑Point Performance Gains at 32× Compression Using Marginal Information Gain (ICLR 2026)

The COMI framework introduces a marginal information gain metric and a coarse‑to‑fine adaptive compression strategy that preserves relevance and diversity, enabling 32× text compression while boosting downstream QA performance by up to 25 points and doubling inference speed.

Context CompressionEfficient InferenceLong-Context Retrieval
0 likes · 7 min read
How COMI Achieves 25‑Point Performance Gains at 32× Compression Using Marginal Information Gain (ICLR 2026)
AI Frontier Lectures
AI Frontier Lectures
Feb 10, 2026 · Artificial Intelligence

Can an 8B Model Outperform GPT‑4 in Faithfulness Detection? Inside FaithLens

FaithLens is an 8‑billion‑parameter model that surpasses GPT‑4.1 and other large models on twelve hallucination‑detection benchmarks while providing high‑quality natural‑language explanations, thanks to a novel data‑synthesis pipeline, three‑dimensional filtering, and rule‑based reinforcement learning.

Efficient InferenceLLM hallucinationexplainable AI
0 likes · 12 min read
Can an 8B Model Outperform GPT‑4 in Faithfulness Detection? Inside FaithLens
PaperAgent
PaperAgent
Jan 13, 2026 · Artificial Intelligence

How Engram’s Conditional Memory Redefines Sparsity in Large Language Models

DeepSeek’s newly released Engram module introduces a conditional memory mechanism that leverages O(1) N‑gram lookup to create a new sparsity axis for large language models, reducing early‑layer compute, improving inference efficiency, and delivering notable performance gains across reasoning and knowledge tasks, as demonstrated by extensive experiments on 27‑billion‑parameter models.

Efficient InferenceEngramLLM Sparsity
0 likes · 8 min read
How Engram’s Conditional Memory Redefines Sparsity in Large Language Models
21CTO
21CTO
Nov 4, 2025 · Artificial Intelligence

LongCat-Flash-Omni: How an Open-Source 560B Model Achieves Real-Time Multimodal Mastery

LongCat-Flash-Omni, an open‑source 560 billion‑parameter multimodal model, combines efficient Shortcut‑Connected MoE architecture with advanced perception and speech modules to deliver low‑latency real‑time audio‑video interaction and state‑of‑the‑art performance across text, image, video, and audio tasks.

Efficient InferenceLarge Language ModelReal-Time Interaction
0 likes · 10 min read
LongCat-Flash-Omni: How an Open-Source 560B Model Achieves Real-Time Multimodal Mastery
AntTech
AntTech
Oct 29, 2025 · Artificial Intelligence

Inside Ant’s Baoling: Balancing Efficiency and Reasoning in a 1‑Trillion‑Parameter Model

At the Ant Star Innovation Journey event, the Baoling team unveiled their roadmap for trillion‑parameter models, detailing the development of Ling‑1T, Ring‑1T and multimodal Ming series, the scaling‑law‑guided architecture, training innovations, evaluation methods, and open‑source releases that aim to advance efficient, high‑performance AI.

Efficient InferenceLarge Language ModelScaling Law
0 likes · 24 min read
Inside Ant’s Baoling: Balancing Efficiency and Reasoning in a 1‑Trillion‑Parameter Model
Meituan Technology Team
Meituan Technology Team
Oct 9, 2025 · Artificial Intelligence

How VSRM Cuts Redundant Reasoning Steps in Large Language Models

The paper introduces VSRM, a verifiable step‑reward mechanism that penalizes ineffective reasoning steps and rewards useful ones in large language model inference, dramatically shortening output length while preserving or even improving performance across multiple benchmarks and reinforcement‑learning algorithms.

AIEfficient Inferencelarge-language-models
0 likes · 10 min read
How VSRM Cuts Redundant Reasoning Steps in Large Language Models
AntTech
AntTech
Sep 11, 2025 · Artificial Intelligence

Ling-mini-2.0: How a 16B MoE Model Achieves Dense-Level Performance with Only 1.4B Active Parameters

Ling-mini-2.0, an open-source 16 B MoE language model that activates only 1.4 B parameters, achieves dense-level performance with 7× efficiency, generates over 300 tokens / s, and introduces the first FP8 mixed-precision training suite, offering multiple pre-training checkpoints for the AI community.

Efficient InferenceFP8 trainingMoE
0 likes · 6 min read
Ling-mini-2.0: How a 16B MoE Model Achieves Dense-Level Performance with Only 1.4B Active Parameters
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Feb 25, 2025 · Artificial Intelligence

How DistilQwen2.5 Boosts LLM Efficiency with Dual‑Stage Knowledge Distillation

This article introduces DistilQwen2.5, a lightweight LLM series built on Qwen2.5 that uses a novel two‑layer distillation framework, instruction‑data optimization, and parameter‑fusion techniques to achieve higher performance while drastically reducing computational cost and deployment overhead.

Efficient InferenceKnowledge DistillationLLM
0 likes · 26 min read
How DistilQwen2.5 Boosts LLM Efficiency with Dual‑Stage Knowledge Distillation