Why External Retrieval in RAG Is Redundant: Insights from NVIDIA’s INTRA Paper
The INTRA paper shows that using a decoder’s cross‑attention as an internal retrieval mechanism eliminates the need for a separate retriever, achieving state‑of‑the‑art multihop QA performance with only 164 K trainable parameters and shared pre‑encoded representations.
RAG’s Overlooked Problem
Traditional RAG pipelines use a retriever (e.g., BM25, BGE, ColBERT) to fetch documents from a corpus and a generator (LLM) to re‑encode the retrieved text before answering, but the retriever and generator operate in different representation spaces, causing a retriever‑generator mismatch.
INTRA’s Core Idea
INTRA treats attention as retrieval: the decoder’s cross‑attention queries the entire pre‑encoded corpus, scoring each chunk with a learned retrieval token using a MaxSim‑style similarity. The top‑n chunks are then fed back as the same pre‑encoded states for generation, removing the need for an external retriever.
Attention = Retrieval
Both attention and retrieval are query‑conditioned matching over candidate states; mathematically they are the same operation instantiated differently. INTRA implements this by pre‑encoding all corpus chunks {k₁,…,kₘ} once, adding learnable retrieval tokens to the decoder, and using cross‑attention to score chunks.
Reverse‑QWK (RQWK) Engineering Trick
Standard Transformers use a distinct key projection matrix Wₖ,ₗ per layer, leading to O(L×M) storage if applied directly. INTRA moves the key projection to the query side: a shared normalized key \(\bar{K}=\text{RMSNorm}(K)\) is stored once, and each layer computes \(\tilde{q}_l = (q_l W_{K,l}^T) \odot \gamma_{K,l}\). This is mathematically equivalent but shares the same encoding across layers, fully unifying retrieval and generation.
Training and Inference
Only 164 K parameters are trained: the retrieval token embeddings (≈164 K) and a small set of layer‑wise aggregation weights (272). The encoder and decoder are frozen. Training optimises a soft cross‑entropy on oracle evidence chunks, teaching the retrieval token to place probability mass on the correct evidence.
Because the pre‑encoded states are reused across queries, inference requires only two decoder forward passes and no re‑encoding, dramatically reducing latency when the corpus is static.
Empirical Results on Multihop QA
INTRA was evaluated on four Wikipedia QA benchmarks. On the multihop datasets HotPotQA, 2WikiMultihopQA, and MuSiQue, INTRA’s complete evidence recall rate surpasses nine baselines (sparse, dense, re‑ranking, hybrid, and ColBERT‑style MaxSim). The advantage stems from the decoder’s attention weights directly reflecting the information needed for answer generation, which is especially beneficial for assembling multiple evidence pieces.
On single‑hop Natural Questions the gain is modest, as only one supporting paragraph is required.
End‑to‑End QA Quality
Using the same T5Gemma2 decoder, INTRA achieves the highest Exact Match (EM) scores: HotPotQA 41.3, 2Wiki 31.6, MuSiQue 15.8, outperforming BM25, BGE‑large, Qwen3‑Embedding, Hybrid RAG, and others.
When stronger generators (Qwen2.5‑7B, Qwen2.5‑72B) replace T5Gemma2, absolute EM improves but the “Gap Closure” metric—measuring how much INTRA bridges the gap between random chunks and oracle evidence—decreases, indicating misalignment between the generator’s attention and INTRA’s retrieval signal.
Conclusion
The paper argues that the fundamental issue is not RAG’s overall quality but the separation of retrieval and generation into distinct representation spaces. By unifying them through attention‑based internal retrieval, the retriever‑generator mismatch disappears, leading to more precise evidence assembly for multihop reasoning.
Current limitations include reliance on a relatively small encoder‑decoder model (T5Gemma2 4B); larger decoder‑only models remain stronger. Nonetheless, the approach suggests that future large encoder‑decoder architectures could amplify INTRA’s benefits.
Retrieval from Within: An Intrinsic Capability of Attention-Based Models
https://arxiv.org/pdf/2605.05806Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
