Tagged articles

11 articles

Page 1 of 1

May 12, 2026 · Artificial Intelligence

DECS Cuts Overthinking in Models: Halve Inference Tokens and Raise Accuracy

DECS, a novel training framework introduced by researchers from Fudan, Shanghai Jiao Tong, and the Shanghai AI Lab, theoretically exposes the flaws of length‑penalty rewards and, through token‑level reward decoupling and dynamic batch scheduling, reduces inference token counts by over 50% while improving accuracy across multiple benchmarks.

DECSbenchmark evaluationinference efficiency

0 likes · 9 min read

DECS Cuts Overthinking in Models: Halve Inference Tokens and Raise Accuracy

Machine Heart

May 2, 2026 · Artificial Intelligence

RouteMoA: Dynamic Routing Without Pre‑Inference for Efficient Multi‑Agent Mixture

The paper introduces RouteMoA, a dynamic routing framework that predicts model capabilities before inference to avoid unnecessary computation, thereby cutting cost by 89.8% and latency by 63.6% while improving accuracy in large‑scale multi‑model pools.

Dynamic RoutingMixture of AgentsModel selection

0 likes · 8 min read

RouteMoA: Dynamic Routing Without Pre‑Inference for Efficient Multi‑Agent Mixture

AI Explorer

Apr 30, 2026 · Artificial Intelligence

Ant Opens Trillion-Parameter Ling-2.6: Hybrid Architecture for Fast Thinking

Ant Group’s AntBaiLing team has open‑sourced the trillion‑parameter Ling‑2.6‑1T model, introducing a hybrid architecture that routes simple queries through shallow paths and reserves deep layers for complex reasoning, aiming to boost inference speed and efficiency for real‑time business scenarios while confronting the deployment challenges of massive models.

AIHybrid ArchitectureLarge Language Model

0 likes · 6 min read

Ant Opens Trillion-Parameter Ling-2.6: Hybrid Architecture for Fast Thinking

Tencent Technical Engineering

Apr 23, 2026 · Artificial Intelligence

Tencent Hunyuan Launches Hy3 Preview: Open‑Source Model Boosts Agent Performance

On April 23, Tencent released the open‑source Hy3 preview, a 295 B‑parameter hybrid expert model with 21 B active parameters and 256K context length, delivering substantial gains in complex reasoning, instruction following, code and agent tasks, achieving 40 % faster inference, lower costs, and strong benchmark results across Tencent’s AI products.

Benchmark ResultsHy3-previewLarge Language Model

0 likes · 9 min read

Tencent Hunyuan Launches Hy3 Preview: Open‑Source Model Boosts Agent Performance

AntTech

Apr 23, 2026 · Artificial Intelligence

Ling-2.6-flash: Faster Response, Stronger Execution, and Higher Token Efficiency for Agent Workloads

Ling-2.6-flash is a 104B‑parameter Instruct model that uses a mixed‑linear architecture and token‑efficiency optimizations to achieve up to 340 tokens/s inference speed, 4× higher throughput than comparable models, and ten‑fold lower token consumption on Agent benchmarks, while maintaining SOTA performance.

Agent OptimizationLLMbenchmark

0 likes · 15 min read

Ling-2.6-flash: Faster Response, Stronger Execution, and Higher Token Efficiency for Agent Workloads

AI Explorer

Mar 20, 2026 · Artificial Intelligence

Meta Agent Leak Triggers Zuckerberg’s Emergency Response and Signals New AI Strategy

Meta’s internal “Meta Agent” AI project was unexpectedly exposed, revealing a novel deep‑learning architecture focused on inference efficiency and multimodal understanding; the leak has sparked debate over whether it was an accident or a strategic signal in the escalating AI arms race, prompting Zuckerberg to act swiftly.

AIAI competitionMeta

0 likes · 6 min read

Meta Agent Leak Triggers Zuckerberg’s Emergency Response and Signals New AI Strategy

Machine Learning Algorithms & Natural Language Processing

Mar 11, 2026 · Artificial Intelligence

Why LLMs Overthink: ICLR2026 Study Reveals the Key Bottleneck in Inference Efficiency

The ICLR2026 paper identifies reasoning miscalibration—overthinking easy steps and underthinking critical ones—as the root cause of runaway LLM inference costs, and proposes the Budget Allocation Model (BAM) and a training‑free Plan‑and‑Budget framework that smartly distributes compute, achieving up to 70% higher accuracy while cutting token usage by 39% and boosting the new E³ efficiency metric by 193.8%.

Budget Allocation ModelE3 MetricEpistemic Uncertainty

0 likes · 12 min read

Why LLMs Overthink: ICLR2026 Study Reveals the Key Bottleneck in Inference Efficiency

SuanNi

Feb 27, 2026 · Artificial Intelligence

Can Deep Thought Ratio Reveal the True Reasoning Power of LLMs?

This article introduces the Deep Thought Ratio (DTR) metric, explains how tracking token modifications across neural network layers quantifies genuine inference effort, and shows through extensive experiments that DTR predicts accuracy far better than token length while enabling a sampling strategy that halves computational cost.

AI metricsLLM evaluationToken analysis

0 likes · 9 min read

Can Deep Thought Ratio Reveal the True Reasoning Power of LLMs?

PMTalk Product Manager Community

Jan 31, 2026 · Industry Insights

Why Token Costs Matter: A Product Manager’s Guide to AI Scaling and Efficiency

The article analyzes how scaling laws still drive AI progress while product focus shifts toward low‑cost inference, explains how reasoning abilities create a positive feedback loop, and shows why token and power consumption have become the decisive factors for competitive AI services.

AI scalingIndustry Insightinference efficiency

0 likes · 9 min read

Why Token Costs Matter: A Product Manager’s Guide to AI Scaling and Efficiency

AI2ML AI to Machine Learning

Dec 29, 2025 · Artificial Intelligence

How Brin’s Return Powers Google’s First ‘Sword’: The TPU Hardware Revolution

The article examines Google’s AI resurgence after Sergey Brin’s comeback, detailing the evolution of TPU hardware from v1 to v7, the strategic focus on algorithmic efficiency, comparisons with Nvidia’s B200, the role of JAX/XLA, and how these advances create a powerful competitive moat for Google’s AI infrastructure.

AI hardwareGoogle TPUJAX

0 likes · 8 min read

How Brin’s Return Powers Google’s First ‘Sword’: The TPU Hardware Revolution

Network Intelligence Research Center (NIRC)

Mar 26, 2025 · Artificial Intelligence

Enable Traditional LLMs to Use DeepSeek’s Multi‑Head Latent Attention Without Retraining

The paper introduces MHA2MLA, a data‑efficient fine‑tuning framework that converts pre‑trained multi‑head attention LLMs to DeepSeek’s Multi‑Head Latent Attention architecture, achieving up to 92% KV‑cache compression with less than 0.5% performance loss on long‑context tasks.

LLMLow-Rank ApproximationModel Compression

0 likes · 8 min read

Enable Traditional LLMs to Use DeepSeek’s Multi‑Head Latent Attention Without Retraining