Artificial Intelligence 12 min read

AgenticRAG Delivers 5.9× Recall Boost in Enterprise Retrieval – Real‑World Pre‑Production Results

The article analyzes Microsoft’s AgenticRAG, a tool‑based RAG framework that lets LLMs control retrieval, showing up to a 5.9× recall improvement over standard methods, reduced need for fine‑tuning, and practical design insights from pre‑production deployment.

PaperAgent

May 28, 2026

AgenticRAG Delivers 5.9× Recall Boost in Enterprise Retrieval – Real‑World Pre‑Production Results

Fundamental Issue of Standard RAG

Standard RAG assumes retrieval is completed before the LLM starts reasoning, so the model receives a fixed set of documents and cannot ask to search again or explore further. This works for simple fact queries but fails for enterprise queries that are highly contextual and whose answers are scattered across many long documents.

Giving Retrieval Control to the LLM

AgenticRAG introduces a lightweight agent framework with four tools—search, find, open, summarize—allowing the LLM to decide what to retrieve, where to look, and when to compress context. The system runs in a bounded iterative loop (default max 15 rounds) and stops when the model outputs an answer or the iteration limit is reached.

Four Tools

search – delegates to an enterprise search stack (e.g., Azure AI Search), issues up to five parallel query rewrites, returns snippets with metadata and a unique reference ID.

find – precise in‑document search given a reference ID and keyword patterns, returns up to two matching paragraphs (≈11 K tokens).

open – scrollable window reading, returns a fixed 1 800‑line window with line numbers so the model can jump to any part of a long document.

summarize – triggered when the token budget (≈128 K) is exhausted; the model marks reference IDs to keep, and the system discards unreferenced tool outputs, preserving useful context.

Inference Loop

Each round the LLM sees the conversation history and tool schema, chooses either to call a tool (appending the result) or to produce the final answer. Termination occurs when the model outputs a response or the maximum number of iterations is reached.

Method Details

Using Search Results

The search tool returns only snippets; the model must decide which documents merit deeper inspection using either find (when the target content is known) or open (when the location is known).

Parallel Query Rewrites

Up to five query rewrites can be issued in a single tool call. Ablation shows negligible impact on recall (44.84 % vs 49.59 %) but reduces average tool calls from 6.79 to 4.79 (‑29 %).

Context Management

Each tool call can load about 11 K tokens. When the 128 K window is 90 % full, an internal warning is issued; at 100 % the summarize tool is forced.

Strategy Differences Between Claude and GPT‑5‑mini

Claude Sonnet 4.5 adopts a “exploit” strategy: fewer search calls (2.51 vs 3.39), more document openings, and three‑times more semantic find usage, achieving higher recall. GPT‑5‑mini favors “explore”: more searches, fewer openings, and a broader query coverage. In the BRIGHT long‑document benchmark, Claude leads in 7 of 8 domains, improving recall@1 by 6.1 percentage points.

Results: Where the 5.9× Gain Comes From

BRIGHT Long‑Document Retrieval

BM25 baseline recall@1: 11.4 %

Qwen embeddings: 27.8 %

Voyage embeddings: 24.5 %

ReDI (inference‑enhanced): 26.0 %

AgenticRAG + GPT‑5‑mini: 43.5 % (5.2× improvement)

AgenticRAG + Claude Sonnet 4.5: 49.6 % (5.9× improvement)

Claude Sonnet 4.5 exceeds the best embedding baseline by 21.8 percentage points, with gains over 30 pp in economics, earth science, and robotics.

Ablation: Single Search vs Full Agent Tools

Single search (raw enterprise stack): recall@1 = 8.41 %

Adding the full agent toolset: 49.59 % (Claude) / 43.49 % (GPT‑5‑mini)

Overall lift: 5.9× (Claude) / 5.2× (GPT‑5‑mini)

The quality gap of the underlying search stack becomes negligible once the agent framework is applied; no new embedding model or re‑ranking training is required.

Enterprise QA (WixQA)

GPT‑5‑mini + AgenticRAG achieves a factuality score of 0.96, a 13 % relative gain over the best baseline (E5 embedding, 0.85). On a simulated query set the gain rises to 22 % (0.94 vs 0.77).

Financial Report QA (FinanceBench)

Across 84 long‑form reports (average 143 pages, 117 K tokens), GPT‑5‑mini + AgenticRAG reaches 92 % accuracy, only 2 pp below the oracle upper bound of 94 %.

Token Cost

BRIGHT queries consume on average 52.3 K tokens, 2.6× the 20.4 K of a single‑search run, but deliver a 5.9× recall boost, yielding a favorable cost‑performance trade‑off. Average tool calls per query stay between 4.48 and 4.79, well below the 15‑round limit.

Key Design Takeaways (Author’s View)

Show document metadata (title, filename, type) with search results to help the model avoid duplicate retrieval.

Expose line numbers so the model can anchor and jump precisely with open.

Retain reference IDs after summarization so the model can continue deep investigation.

Use hybrid routing: cheap, fast RAG for simple queries; AgenticRAG for complex, high‑precision queries.

The paper is already in pre‑production evaluation within Microsoft Copilot Studio, suggesting a short gap between academic results and product deployment.

论文标题: AgenticRAG: Agentic Retrieval for Enterprise Knowledge Bases
论文链接: https://arxiv.org/abs/2605.05538v1

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LLM recall Retrieval-Augmented Generation Claude enterprise search AgenticRAG GPT-5-mini

Written by

PaperAgent

Daily updates, analyzing cutting-edge AI research papers

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.