Tagged articles
11 articles
Page 1 of 1
Machine Heart
Machine Heart
May 12, 2026 · Artificial Intelligence

DECS Cuts Overthinking in Models: Halve Inference Tokens and Raise Accuracy

DECS, a novel training framework introduced by researchers from Fudan, Shanghai Jiao Tong, and the Shanghai AI Lab, theoretically exposes the flaws of length‑penalty rewards and, through token‑level reward decoupling and dynamic batch scheduling, reduces inference token counts by over 50% while improving accuracy across multiple benchmarks.

DECSbenchmark evaluationinference efficiency
0 likes · 9 min read
DECS Cuts Overthinking in Models: Halve Inference Tokens and Raise Accuracy
AI Explorer
AI Explorer
Apr 30, 2026 · Artificial Intelligence

Ant Opens Trillion-Parameter Ling-2.6: Hybrid Architecture for Fast Thinking

Ant Group’s AntBaiLing team has open‑sourced the trillion‑parameter Ling‑2.6‑1T model, introducing a hybrid architecture that routes simple queries through shallow paths and reserves deep layers for complex reasoning, aiming to boost inference speed and efficiency for real‑time business scenarios while confronting the deployment challenges of massive models.

AIHybrid ArchitectureLarge Language Model
0 likes · 6 min read
Ant Opens Trillion-Parameter Ling-2.6: Hybrid Architecture for Fast Thinking
Tencent Technical Engineering
Tencent Technical Engineering
Apr 23, 2026 · Artificial Intelligence

Tencent Hunyuan Launches Hy3 Preview: Open‑Source Model Boosts Agent Performance

On April 23, Tencent released the open‑source Hy3 preview, a 295 B‑parameter hybrid expert model with 21 B active parameters and 256K context length, delivering substantial gains in complex reasoning, instruction following, code and agent tasks, achieving 40 % faster inference, lower costs, and strong benchmark results across Tencent’s AI products.

Benchmark ResultsHy3-previewLarge Language Model
0 likes · 9 min read
Tencent Hunyuan Launches Hy3 Preview: Open‑Source Model Boosts Agent Performance
AntTech
AntTech
Apr 23, 2026 · Artificial Intelligence

Ling-2.6-flash: Faster Response, Stronger Execution, and Higher Token Efficiency for Agent Workloads

Ling-2.6-flash is a 104B‑parameter Instruct model that uses a mixed‑linear architecture and token‑efficiency optimizations to achieve up to 340 tokens/s inference speed, 4× higher throughput than comparable models, and ten‑fold lower token consumption on Agent benchmarks, while maintaining SOTA performance.

Agent OptimizationLLMbenchmark
0 likes · 15 min read
Ling-2.6-flash: Faster Response, Stronger Execution, and Higher Token Efficiency for Agent Workloads
AI Explorer
AI Explorer
Mar 20, 2026 · Artificial Intelligence

Meta Agent Leak Triggers Zuckerberg’s Emergency Response and Signals New AI Strategy

Meta’s internal “Meta Agent” AI project was unexpectedly exposed, revealing a novel deep‑learning architecture focused on inference efficiency and multimodal understanding; the leak has sparked debate over whether it was an accident or a strategic signal in the escalating AI arms race, prompting Zuckerberg to act swiftly.

AIAI competitionMeta
0 likes · 6 min read
Meta Agent Leak Triggers Zuckerberg’s Emergency Response and Signals New AI Strategy
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Mar 11, 2026 · Artificial Intelligence

Why LLMs Overthink: ICLR2026 Study Reveals the Key Bottleneck in Inference Efficiency

The ICLR2026 paper identifies reasoning miscalibration—overthinking easy steps and underthinking critical ones—as the root cause of runaway LLM inference costs, and proposes the Budget Allocation Model (BAM) and a training‑free Plan‑and‑Budget framework that smartly distributes compute, achieving up to 70% higher accuracy while cutting token usage by 39% and boosting the new E³ efficiency metric by 193.8%.

Budget Allocation ModelE3 MetricEpistemic Uncertainty
0 likes · 12 min read
Why LLMs Overthink: ICLR2026 Study Reveals the Key Bottleneck in Inference Efficiency
SuanNi
SuanNi
Feb 27, 2026 · Artificial Intelligence

Can Deep Thought Ratio Reveal the True Reasoning Power of LLMs?

This article introduces the Deep Thought Ratio (DTR) metric, explains how tracking token modifications across neural network layers quantifies genuine inference effort, and shows through extensive experiments that DTR predicts accuracy far better than token length while enabling a sampling strategy that halves computational cost.

AI metricsLLM evaluationToken analysis
0 likes · 9 min read
Can Deep Thought Ratio Reveal the True Reasoning Power of LLMs?
AI2ML AI to Machine Learning
AI2ML AI to Machine Learning
Dec 29, 2025 · Artificial Intelligence

How Brin’s Return Powers Google’s First ‘Sword’: The TPU Hardware Revolution

The article examines Google’s AI resurgence after Sergey Brin’s comeback, detailing the evolution of TPU hardware from v1 to v7, the strategic focus on algorithmic efficiency, comparisons with Nvidia’s B200, the role of JAX/XLA, and how these advances create a powerful competitive moat for Google’s AI infrastructure.

AI hardwareGoogle TPUJAX
0 likes · 8 min read
How Brin’s Return Powers Google’s First ‘Sword’: The TPU Hardware Revolution
Network Intelligence Research Center (NIRC)
Network Intelligence Research Center (NIRC)
Mar 26, 2025 · Artificial Intelligence

Enable Traditional LLMs to Use DeepSeek’s Multi‑Head Latent Attention Without Retraining

The paper introduces MHA2MLA, a data‑efficient fine‑tuning framework that converts pre‑trained multi‑head attention LLMs to DeepSeek’s Multi‑Head Latent Attention architecture, achieving up to 92% KV‑cache compression with less than 0.5% performance loss on long‑context tasks.

LLMLow-Rank ApproximationModel Compression
0 likes · 8 min read
Enable Traditional LLMs to Use DeepSeek’s Multi‑Head Latent Attention Without Retraining