Artificial Intelligence 8 min read

Why External Retrieval in RAG Is Redundant: Insights from NVIDIA’s INTRA Paper

The INTRA paper shows that using a decoder’s cross‑attention as an internal retrieval mechanism eliminates the need for a separate retriever, achieving state‑of‑the‑art multihop QA performance with only 164 K trainable parameters and shared pre‑encoded representations.

PaperAgent

May 26, 2026

Why External Retrieval in RAG Is Redundant: Insights from NVIDIA’s INTRA Paper

RAG’s Overlooked Problem

Traditional RAG pipelines use a retriever (e.g., BM25, BGE, ColBERT) to fetch documents from a corpus and a generator (LLM) to re‑encode the retrieved text before answering, but the retriever and generator operate in different representation spaces, causing a retriever‑generator mismatch.

INTRA’s Core Idea

INTRA treats attention as retrieval: the decoder’s cross‑attention queries the entire pre‑encoded corpus, scoring each chunk with a learned retrieval token using a MaxSim‑style similarity. The top‑n chunks are then fed back as the same pre‑encoded states for generation, removing the need for an external retriever.

Attention = Retrieval

Both attention and retrieval are query‑conditioned matching over candidate states; mathematically they are the same operation instantiated differently. INTRA implements this by pre‑encoding all corpus chunks {k₁,…,kₘ} once, adding learnable retrieval tokens to the decoder, and using cross‑attention to score chunks.

Reverse‑QWK (RQWK) Engineering Trick

Standard Transformers use a distinct key projection matrix Wₖ,ₗ per layer, leading to O(L×M) storage if applied directly. INTRA moves the key projection to the query side: a shared normalized key \(\bar{K}=\text{RMSNorm}(K)\) is stored once, and each layer computes \(\tilde{q}_l = (q_l W_{K,l}^T) \odot \gamma_{K,l}\). This is mathematically equivalent but shares the same encoding across layers, fully unifying retrieval and generation.

Training and Inference

Only 164 K parameters are trained: the retrieval token embeddings (≈164 K) and a small set of layer‑wise aggregation weights (272). The encoder and decoder are frozen. Training optimises a soft cross‑entropy on oracle evidence chunks, teaching the retrieval token to place probability mass on the correct evidence.

Because the pre‑encoded states are reused across queries, inference requires only two decoder forward passes and no re‑encoding, dramatically reducing latency when the corpus is static.

Empirical Results on Multihop QA

INTRA was evaluated on four Wikipedia QA benchmarks. On the multihop datasets HotPotQA, 2WikiMultihopQA, and MuSiQue, INTRA’s complete evidence recall rate surpasses nine baselines (sparse, dense, re‑ranking, hybrid, and ColBERT‑style MaxSim). The advantage stems from the decoder’s attention weights directly reflecting the information needed for answer generation, which is especially beneficial for assembling multiple evidence pieces.

On single‑hop Natural Questions the gain is modest, as only one supporting paragraph is required.

End‑to‑End QA Quality

Using the same T5Gemma2 decoder, INTRA achieves the highest Exact Match (EM) scores: HotPotQA 41.3, 2Wiki 31.6, MuSiQue 15.8, outperforming BM25, BGE‑large, Qwen3‑Embedding, Hybrid RAG, and others.

When stronger generators (Qwen2.5‑7B, Qwen2.5‑72B) replace T5Gemma2, absolute EM improves but the “Gap Closure” metric—measuring how much INTRA bridges the gap between random chunks and oracle evidence—decreases, indicating misalignment between the generator’s attention and INTRA’s retrieval signal.

Conclusion

The paper argues that the fundamental issue is not RAG’s overall quality but the separation of retrieval and generation into distinct representation spaces. By unifying them through attention‑based internal retrieval, the retriever‑generator mismatch disappears, leading to more precise evidence assembly for multihop reasoning.

Current limitations include reliance on a relatively small encoder‑decoder model (T5Gemma2 4B); larger decoder‑only models remain stronger. Nonetheless, the approach suggests that future large encoder‑decoder architectures could amplify INTRA’s benefits.

Retrieval from Within: An Intrinsic Capability of Attention-Based Models
https://arxiv.org/pdf/2605.05806

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

RAG attention retrieval decoder INTRA multihop QA

Written by

PaperAgent

Daily updates, analyzing cutting-edge AI research papers

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

RAG’s Overlooked Problem

INTRA’s Core Idea

Attention = Retrieval

Reverse‑QWK (RQWK) Engineering Trick

Training and Inference

Empirical Results on Multihop QA

End‑to‑End QA Quality

Conclusion

PaperAgent

How this landed with the community

Was this worth your time?

0 Comments

Attention = Retrieval