MoBA: Mixture of Block Attention for Long‑Context Large Language Models
The article introduces MoBA, a Mixture‑of‑Block‑Attention mechanism that applies Mixture‑of‑Experts principles to transformer attention, enabling efficient long‑context processing for large language models while maintaining performance comparable to full attention through sparse, trainable block selection and seamless switching.
On the same day, DeepSeek and Moonshot AI (Moonshot) released papers proposing new attention mechanisms—DeepSeek's NSA and Moonshot's MoBA (Mixture of Block Attention). While DeepSeek published only a paper, Moonshot also released code that has been deployed for a year, demonstrating robustness.
MoBA treats attention as a mixture‑of‑experts problem: the context is divided into blocks, and a parameter‑free top‑k gating network routes each query token to the most relevant blocks. This trainable block‑sparse attention reduces the quadratic cost of traditional attention to sub‑quadratic while preserving the ability to switch seamlessly between full and sparse modes.
The paper details the architecture, including block partitioning, routing strategy, and a no‑parameter gating mechanism. Experiments on scaling laws and ablations show that MoBA matches full attention performance with up to 75% sparsity, and its loss gap stays within 1e‑3 across various model sizes.
MoBA’s scalability was verified by extending sequence length from 8k to 32k tokens, with only a slight loss increase. Ablation studies highlighted the importance of fine‑grained block sizes, where coarser block selections degraded performance by about 1e‑2.
Hybrid training strategies were explored: a two‑stage approach uses MoBA for 90% of tokens and switches to full attention for the remaining 10%, achieving loss comparable to pure full attention. A layered hybrid—replacing the top transformer layers with full attention while keeping lower layers as MoBA—mitigated sparse‑gradient issues observed during supervised fine‑tuning.
Using the Llama‑3.1 8B base, Moonshot built Llama‑8B1M‑MoBA, extending context up to 1 M tokens with 95.31% attention sparsity, keeping the last three layers as full attention. Evaluation on benchmarks such as RULER showed MoBA’s performance nearly identical to full‑attention models, even under high sparsity.
Efficiency gains are substantial: processing 1 M tokens is 6.5× faster than full attention, and at 10 M tokens MoBA achieves a 16× speedup over standard FlashAttention, reducing computational complexity from quadratic to sub‑quadratic.
For full details, refer to the MoBA technical report and the accompanying GitHub repository.
Architecture Digest
Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.