AgenticRAG Delivers 5.9× Recall Boost in Enterprise Retrieval – Real‑World Pre‑Production Results
The article analyzes Microsoft’s AgenticRAG, a tool‑based RAG framework that lets LLMs control retrieval, showing up to a 5.9× recall improvement over standard methods, reduced need for fine‑tuning, and practical design insights from pre‑production deployment.
Fundamental Issue of Standard RAG
Standard RAG assumes retrieval is completed before the LLM starts reasoning, so the model receives a fixed set of documents and cannot ask to search again or explore further. This works for simple fact queries but fails for enterprise queries that are highly contextual and whose answers are scattered across many long documents.
Giving Retrieval Control to the LLM
AgenticRAG introduces a lightweight agent framework with four tools—search, find, open, summarize—allowing the LLM to decide what to retrieve, where to look, and when to compress context. The system runs in a bounded iterative loop (default max 15 rounds) and stops when the model outputs an answer or the iteration limit is reached.
Four Tools
search – delegates to an enterprise search stack (e.g., Azure AI Search), issues up to five parallel query rewrites, returns snippets with metadata and a unique reference ID.
find – precise in‑document search given a reference ID and keyword patterns, returns up to two matching paragraphs (≈11 K tokens).
open – scrollable window reading, returns a fixed 1 800‑line window with line numbers so the model can jump to any part of a long document.
summarize – triggered when the token budget (≈128 K) is exhausted; the model marks reference IDs to keep, and the system discards unreferenced tool outputs, preserving useful context.
Inference Loop
Each round the LLM sees the conversation history and tool schema, chooses either to call a tool (appending the result) or to produce the final answer. Termination occurs when the model outputs a response or the maximum number of iterations is reached.
Method Details
Using Search Results
The search tool returns only snippets; the model must decide which documents merit deeper inspection using either find (when the target content is known) or open (when the location is known).
Parallel Query Rewrites
Up to five query rewrites can be issued in a single tool call. Ablation shows negligible impact on recall (44.84 % vs 49.59 %) but reduces average tool calls from 6.79 to 4.79 (‑29 %).
Context Management
Each tool call can load about 11 K tokens. When the 128 K window is 90 % full, an internal warning is issued; at 100 % the summarize tool is forced.
Strategy Differences Between Claude and GPT‑5‑mini
Claude Sonnet 4.5 adopts a “exploit” strategy: fewer search calls (2.51 vs 3.39), more document openings, and three‑times more semantic find usage, achieving higher recall. GPT‑5‑mini favors “explore”: more searches, fewer openings, and a broader query coverage. In the BRIGHT long‑document benchmark, Claude leads in 7 of 8 domains, improving recall@1 by 6.1 percentage points.
Results: Where the 5.9× Gain Comes From
BRIGHT Long‑Document Retrieval
BM25 baseline recall@1: 11.4 %
Qwen embeddings: 27.8 %
Voyage embeddings: 24.5 %
ReDI (inference‑enhanced): 26.0 %
AgenticRAG + GPT‑5‑mini: 43.5 % (5.2× improvement)
AgenticRAG + Claude Sonnet 4.5: 49.6 % (5.9× improvement)
Claude Sonnet 4.5 exceeds the best embedding baseline by 21.8 percentage points, with gains over 30 pp in economics, earth science, and robotics.
Ablation: Single Search vs Full Agent Tools
Single search (raw enterprise stack): recall@1 = 8.41 %
Adding the full agent toolset: 49.59 % (Claude) / 43.49 % (GPT‑5‑mini)
Overall lift: 5.9× (Claude) / 5.2× (GPT‑5‑mini)
The quality gap of the underlying search stack becomes negligible once the agent framework is applied; no new embedding model or re‑ranking training is required.
Enterprise QA (WixQA)
GPT‑5‑mini + AgenticRAG achieves a factuality score of 0.96, a 13 % relative gain over the best baseline (E5 embedding, 0.85). On a simulated query set the gain rises to 22 % (0.94 vs 0.77).
Financial Report QA (FinanceBench)
Across 84 long‑form reports (average 143 pages, 117 K tokens), GPT‑5‑mini + AgenticRAG reaches 92 % accuracy, only 2 pp below the oracle upper bound of 94 %.
Token Cost
BRIGHT queries consume on average 52.3 K tokens, 2.6× the 20.4 K of a single‑search run, but deliver a 5.9× recall boost, yielding a favorable cost‑performance trade‑off. Average tool calls per query stay between 4.48 and 4.79, well below the 15‑round limit.
Key Design Takeaways (Author’s View)
Show document metadata (title, filename, type) with search results to help the model avoid duplicate retrieval.
Expose line numbers so the model can anchor and jump precisely with open.
Retain reference IDs after summarization so the model can continue deep investigation.
Use hybrid routing: cheap, fast RAG for simple queries; AgenticRAG for complex, high‑precision queries.
The paper is already in pre‑production evaluation within Microsoft Copilot Studio, suggesting a short gap between academic results and product deployment.
论文标题: AgenticRAG: Agentic Retrieval for Enterprise Knowledge Bases
论文链接: https://arxiv.org/abs/2605.05538v1Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
