Solo Development of GQLA: Challenging DeepSeek’s MLA and DSA

This article presents GQLA, a single‑author variant of MLA that eliminates three hardware‑related drawbacks of MLA, demonstrates how it achieves balanced compute‑memory performance on both high‑end H100 and more modest H20 GPUs, and details conversion methods (TransGQLA) and sparse extensions with concrete benchmark results.

Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Solo Development of GQLA: Challenging DeepSeek’s MLA and DSA

Motivation and Scope

The work refines a previously conceived idea to address the limitations of Memory‑Latent Attention (MLA) and Dynamic Sparse Attention (DSA). The goal is to match MLA’s performance on compute‑rich NVIDIA H100 GPUs while providing a balanced compute‑memory profile on compute‑constrained H20 GPUs.

Computation Intensity Analysis

LLM pre‑training and pre‑fill are FLOP‑limited, whereas autoregressive decoding is limited by KV‑cache memory traffic. MLA offers two mathematically equivalent execution paths:

Training/prefill: latent vectors are restored to per‑head K and V, enabling a compute‑bound multi‑head‑attention‑like path that reduces FLOPs.

Decoding: the K/V projection matrices are absorbed into the query and output projections, yielding an MQA‑like path that minimizes HBM traffic.

On NVIDIA H100, the BF16 roofline ridge point is ≈295 FLOPs/byte. With a typical configuration (128, 128, 512, 64) and single‑token decoding, MLA’s absorb‑MQA path sits just below this ridge, fully exploiting hardware throughput. Enabling Multi‑Token Prediction (MTP) doubles the FLOPs/byte, pushing the workload above the ridge and making MLA‑absorb compute‑bound, which drastically reduces the ideal MTP throughput gain.

Hardware‑Related Drawbacks of MLA

Hardware coupling: MLA is tuned for H100’s favorable compute‑to‑bandwidth ratio; on H20 (≈4 TB/s bandwidth, 148 TFLOPs BF16) the ridge point shifts lower, making MLA’s compute intensity far above the ridge and causing decode to become compute‑bound.

Tensor‑parallelism unfriendly: The absorb‑MQA path shares a single KV cache across all heads, preventing tensor parallelism along the head/group axis; each card must duplicate the KV cache.

MTP unfriendly: Enabling MTP on H100 pushes MLA above the ridge, turning it compute‑bound; on H20, MTP yields zero throughput gain.

Group‑Query Latent Attention (GQLA)

GQLA modifies MLA by indexing the up‑projection by group rather than by individual query head. This yields two mathematically equivalent decoding paths:

MQA‑absorb path (shared with MLA): caches the latent vector with shared RoPE keys; each token stores a single element per head, and the up‑projection is absorbed, preserving MLA’s high performance on H100.

GQA path (unique to GQLA): expands the cache per group, matching a standard group‑GQA model with only a few extra shared RoPE elements; this path supports tensor parallelism and does not require per‑step latent expansion.

MLA Drawback                | GQLA Solution
---------------------------|-----------------------------------------------
1. Hardware coupling      | Deploy MQA‑absorb on H100, GQA on H20
2. TP unfriendly          | GQA path natively supports group‑axis TP (g=8)
3. MTP unfriendly         | g=8, s_q=2 pushes compute intensity to ~38.8 FLOPs/byte, near H20 ridge; MTP yields ~2× speedup on GQA

Optimal Configurations

H100 deployment: use the MQA‑absorb path, disable MTP, keep the workload below the ridge point; single‑step attention matches MLA’s optimal point.

H20 deployment: use the GQA path, enable MTP, place the workload at the ridge point, achieving simultaneous bandwidth and compute saturation.

Both deployments share the same weight set (128 query heads, 8 KV groups) and a single MTP head, requiring no retraining or custom kernels (MQA‑absorb reuses the MLA‑absorb kernel; GQA reuses the existing GQA kernel).

TransGQLA: Converting Existing Models without Full Pre‑training

Training a GQLA model from scratch is costly. Extending TransMLA, a single directed modification changes the up‑projection to index by group instead of by head, preserving both MQA‑absorb and GQA paths. Example: compressing 20 KV heads of GLM‑4.7 into 4 groups (20:4) incurs only a 4.7 % performance loss without additional training. A similar conversion works for LLaMA‑3‑8B, matching TransMLA’s performance.

Model          Tokens  Avg.  MMLU  ARC  PIQA  HS   OBQA  WG
GLM‑4.7        20T    69.14 73.82 69.0 88.0 70.0 38.0 76.0
MLA→GQLA       0      64.45 60.82 63.86 78.18 73.04 41.60 69.22
LLaMA‑3‑8B     15T    63.84 46.20 65.75 80.47 76.20 45.60 68.82
GQA→GQLA       0      54.13 36.38 52.84 73.83 64.34 37.00 60.38
GQA→MLA(sft)   30B    63.39 46.18 66.30 80.30 76.33 45.00 66.22

Sparse GQLA

DSA adds token‑wise top‑k sparsity to MLA, but sparse MLA remains locked to the MQA‑absorb path and incurs high compute on H20 because each shared KV head requires at least 16 query heads to fill the MMA tile. GQLA’s GQA path aligns with the MMA tile, allowing sparse GQLA to train and decode on H20 via a sparse GQA path while retaining sparse MQA‑absorb on H100.

Conclusion

The three inter‑related hardware drawbacks of MLA are (i) suitability only for H100‑class GPUs, (ii) lack of tensor‑parallelism support in the absorb path, and (iii) loss of MTP benefits on both H100 and H20. GQLA, by indexing the up‑projection by group, provides mathematically equivalent MQA‑absorb and GQA paths that together resolve all three issues. The same weight set serves both optimal roofline points on H100 and H20, requiring no retraining or custom kernels. Sparse GQLA extends these benefits to sparse attention scenarios, and TransGQLA offers a low‑cost conversion pipeline from existing MLA/GQA checkpoints.

References

Article link: https://huggingface.co/papers/2605.15250
Code link: https://github.com/MuLabPKU/TransArch
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

LLMattentionhardware optimizationMLASparse AttentionGQLA
Machine Learning Algorithms & Natural Language Processing
Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.