Artificial Intelligence 19 min read

Self‑Improving Multi‑Agent RAG System: Architecture, Evaluation, and Human‑Reviewed Prompt Loop

An end‑to‑end multi‑agent Retrieval‑Augmented Generation platform is presented, featuring compositional reasoning, systematic multi‑dimensional evaluation, and a controlled prompt‑improvement loop that automatically identifies weak prompt dimensions, proposes diffs, and requires human approval before deployment, with full observability via SSE and persisted logs.

DeepHub IMBA

May 18, 2026

Self‑Improving Multi‑Agent RAG System: Architecture, Evaluation, and Human‑Reviewed Prompt Loop

Retrieval‑Augmented Generation (RAG) has become the mainstream way to connect large language models (LLMs) with external knowledge, but a single‑agent RAG pipeline mixes retrieval quality, reasoning depth, and answer synthesis into one opaque forward call, making evaluation, auditing, and systematic improvement difficult.

This article describes a self‑improving multi‑agent RAG system that closes the evaluation loop: it automatically locates underperforming prompt dimensions, generates targeted rewrite proposals, and subjects them to a human‑approved regression check before deployment. All agent activity is streamed to the client via Server‑Sent Events (SSE) and fully persisted for reproducibility and audit.

System Architecture

The system consists of five Docker services orchestrated with Docker Compose: a FastAPI application server, an ARQ background worker, PostgreSQL 16 with the pgvector extension, Redis, and an Adminer log browser.

Client (HTTP)
│  POST /query/stream
▼
FastAPI (port 9000)
│  enqueue ARQ job
▼
ARQ Worker (async)
    ├── Orchestrator → routing plan (LLM‑generated)
    │       ├── Decomposition Agent (sub‑task DAG)
    │       ├── RAG Agent (multi‑hop pgvector)
    │       ├── Critique Agent (per‑claim confidence scoring)
    │       └── Synthesis Agent (provenance‑mapped final answer)
    ├── Redis Streams → SSE (XADD / XREAD)
    └── PostgreSQL → jobs, eval_runs, agent_logs, prompt_versions

Agents communicate through a JSON‑Schema‑defined SharedContext object; they never call each other directly. The orchestrator mediates every hand‑off, eliminating implicit coupling and making routing plans fully auditable.

Agent Design

Orchestrator receives the raw query and a global token budget, makes a single LLM call, and outputs a structured JSON routing plan that specifies which agents to invoke, their order, token allocation, DAG dependencies, and error‑handling strategies ( retry, skip, abort). The orchestrator does not answer queries itself, keeping planning separate from answer quality.

RAG Agent performs at least two retrieval hops on a pgvector‑indexed 1536‑dimensional embedding store ( text-embedding-3-small). The first hop searches the original query embedding; the second hop formulates a follow‑up query from the first‑hop results and searches again. Only chunks that contribute to the final answer are cited, with source chunk IDs and supporting sentences recorded. If fewer than two hops are executed, a PolicyViolation event is logged to prevent shortcut retrieval.

Decomposition Agent handles ambiguous or multi‑part queries by producing a typed sub‑task DAG: a set of named sub‑tasks with typed outputs and explicit dependencies. The DAG is topologically sorted for cycle detection; independent tasks run in parallel, dependent tasks wait for their predecessors.

Critique Agent reviews each predecessor’s output stored in SharedContext, assigns a confidence score (0–1) to every claim, and marks disagreeing text spans with explicit start < end character ranges. The Pydantic validator rejects any mark lacking a non‑empty interval, forcing precise, falsifiable critiques.

Synthesis Agent merges all outputs, uses the Critique marks for contradiction resolution, and produces a provenance‑mapped final answer that links each sentence back to the originating agent and, when applicable, to the source document chunk. Four contradiction‑resolution strategies are supported (illustrated in the accompanying diagram).

Context‑Window Management

A ContextBudgetManager uses the tiktoken CL100K tokenizer to track cumulative token consumption per agent and per round. Before an agent appends to its context, it checks the remaining budget. If the assembled context exceeds the declared budget, a compression agent creates a lossless summary of structured data while summarizing free‑form dialogue, and a PolicyViolation event records the overflow amount.

Tool Calls

Four tools are registered with the orchestrator, each defining a failure contract ( accepted / rejected). Every tool invocation logs input hash, output hash, latency, and the agent’s binary decision. If the result is insufficient, the agent may retry the tool up to two times, each retry recorded as a separate tool_retry event.

Evaluation Pipeline

The evaluation framework runs 15 test cases through the full pipeline, divided into three categories: baseline (5 cases with known answers), ambiguous (5 cases with vague or multi‑part queries), and adversarial (5 cases with prompt injection, confident false premises, or designed Critique‑Synthesis conflicts). Each case is scored independently on eight dimensions, producing a [0, 1] score and a textual justification. Scoring logic resides in eval_harness.py —no third‑party framework is used.

Reference run (Docker rebuild, cache disabled, eval_run_id: de1dd5ab) yielded an overall score of 0.8083. The worst dimension was wrong_premise_handling with a score of 0.000, indicating the Synthesis Agent accepted confidently false premises without challenge.

Self‑Improving Prompt Loop

The loop consists of five timestamped stages persisted to the database:

POST /eval/run
    │  failure_analyzer identifies worst (dimension, agent_id)
    ▼
    prompt_rewrite_proposer generates structured diff + justification
    ▼  stored as prompt_rewrite (status: pending)
POST /eval/prompt-rewrite/{id}/approve
    ├── regression check: for each dimension where baseline ≥ 0.70
    │   if candidate_score < baseline_score - 0.05 → 409 REGRESSION_DETECTED
    ├── force: true → override regression block
    └── approved → status: approved
POST /eval/retry-failed?rewrite_id={id}
    ▼
    targeted_eval re‑runs only previously failed cases
    ▼
    delta_scores JSONB stored in rewrite_approvals

Regression blocking uses the most recent earlier evaluation run as a baseline (SQL query shown below). If a dimension that previously scored ≥ 0.70 drops by ≥ 0.05, the candidate is blocked unless the request includes force: true, which is recorded as an explicit human decision.

SELECT id FROM eval_runs
WHERE id != CAST(:id AS UUID)
  AND triggered_at < (
      SELECT triggered_at FROM eval_runs WHERE id = CAST(:id AS UUID)
  )
ORDER BY triggered_at DESC LIMIT 1

In the end‑to‑end example query “What is machine learning?” the orchestrator skipped the Critique Agent (low ambiguity), executed Decomposition → RAG → Synthesis, streamed 283 SSE events, and logged nine records across budget_declared, budget_consumed, and routing event types. The evaluation identified the synthesis agent’s wrong_premise_handling as the worst dimension, prompting a rewrite that was approved by a human reviewer (no regression baseline existed, so it passed unconditionally). A targeted re‑evaluation of the six failing cases confirmed the improvement.

Observability and Streaming

All agent activity is written to Redis Streams (XADD/XREAD) and forwarded to clients via SSE. The SSE schema includes events such as agent_token, tool_call, and job_complete, each carrying timestamps, agent IDs, token counts, latency, and optional policy_violation fields. Because events are persisted before streaming, workers can restart without losing history, and late‑joining clients can replay the full sequence from the stream start.

Known Limitations and Future Work

Wrong premise handling is the weakest dimension; the Synthesis Agent does not challenge false premises unless the Critique Agent flags them, and the Critique Agent currently only reviews other agents’ outputs, not the original query.

Citation accuracy (0.667) shows that the RAG Agent sometimes cites relevant but non‑supporting chunks; adding a re‑ranking step before citation could improve this.

Static, small evaluation set (15 cases) limits coverage of real‑world query distributions.

No authentication or rate limiting ; all endpoints are publicly accessible.

High‑value extensions include feeding low‑confidence production queries back into the evaluation suite to grow a living test set, and implementing cross‑dimensional regression analysis so that improvements in one dimension do not inadvertently degrade another.

Conclusion

The presented multi‑agent RAG system separates retrieval, decomposition, critique, and synthesis into dedicated agents coordinated by an orchestrator. It implements a closed‑loop self‑improvement process where evaluation pinpoints weak prompt dimensions, a meta‑agent proposes diffs, and a regression‑blocking approval step guarantees that new changes do not regress previously passing dimensions. Full observability is achieved through SSE‑streamed, Redis‑persisted logs, enabling reproducible audits and debugging.

Design choices such as ARQ over Celery, pgvector over a separate vector store, and Redis Streams over in‑process queues reflect a preference for minimal service count while preserving production‑grade properties like persistence, observability, and reproducibility.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Prompt Engineering RAG evaluation FastAPI Multi-Agent pgvector Redis Streams

Written by

DeepHub IMBA

A must‑follow public account sharing practical AI insights. Follow now. internet + machine learning + big data + architecture = IMBA

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.