Tagged articles
3 articles
Page 1 of 1
Machine Heart
Machine Heart
Apr 29, 2026 · Artificial Intelligence

LCA Boosts Long-Context Inference: 2.5× Speedup and 90% KV Cache Reduction

The Latent‑Condensed Attention (LCA) method dramatically cuts KV‑cache memory by 90%, speeds up pre‑fill by 2.5× and reduces decode latency by 1.8× for 128K‑token contexts, while requiring no extra parameters and preserving model performance across diverse LLMs.

Efficient AttentionInference AccelerationKV cache reduction
0 likes · 10 min read
LCA Boosts Long-Context Inference: 2.5× Speedup and 90% KV Cache Reduction
Baobao Algorithm Notes
Baobao Algorithm Notes
Apr 27, 2026 · Artificial Intelligence

DeepDive into DeepSeek‑V4: Efficient Million‑Token Context, Hybrid Attention, and Muon Optimizer

The article provides an in‑depth technical analysis of DeepSeek‑V4, detailing its novel hybrid attention architecture (CSA and HCA), the manifold‑constrained hyper‑connection (mHC), massive KV‑cache reductions, FLOPs savings across token lengths, and the Muon optimizer with Newton‑Schulz orthogonalization, all backed by concrete benchmark tables and code snippets.

DeepSeekEfficient AttentionKV cache reduction
0 likes · 61 min read
DeepDive into DeepSeek‑V4: Efficient Million‑Token Context, Hybrid Attention, and Muon Optimizer