21 min read

Anthropic’s Multi‑Agent Research System: Architecture, Lessons & 90% Performance Boost

Anthropic’s detailed post explains how its new Research feature uses a multi‑agent architecture with a lead coordinator and parallel sub‑agents, covering design principles, prompt engineering tricks, evaluation methods, production reliability challenges, and the substantial performance gains achieved over single‑agent baselines.

Data Thinking Notes

Jun 24, 2025

Anthropic’s Multi‑Agent Research System: Architecture, Lessons & 90% Performance Boost

Anthropic published a practical experience article titled “How we built our multi‑agent research system,” sharing engineering challenges and lessons learned while developing the “Research” feature, a multi‑agent system.

Claude now has research capabilities, allowing it to browse the web, Google Workspace, and any integrated tool to complete complex tasks.

The system’s journey from prototype to production taught key insights about system architecture, tool design, and prompt engineering. A lead agent acts as a project manager, planning and decomposing tasks, while multiple sub‑agents perform parallel information retrieval and analysis, mimicking expert team collaboration.

1. Advantages of Multi‑Agent Systems

Research tasks are open‑ended and unpredictable, requiring flexible, iterative approaches that single‑pass pipelines cannot provide. Parallel sub‑agents compress large corpora by exploring different aspects simultaneously, offering focus‑separation and reducing path dependency.

Performance tests showed a multi‑agent setup (Claude Opus 4 as lead and Claude Sonnet 4 sub‑agents) improved results by 90.2 % compared to a single Claude Opus 4 agent on tasks such as identifying all board members of S&P 500 IT companies.

Token usage is higher in multi‑agent runs (≈4× chat interactions, ≈15× for the full system), so high‑value tasks are needed to justify the cost.

2. Research Architecture Overview

The system uses an orchestrator‑worker pattern: the lead agent coordinates the process and delegates to parallel specialized sub‑agents.

When a user submits a query, the lead agent analyzes it, creates a strategy, and spawns sub‑agents to explore different facets. Sub‑agents iteratively use search tools, filter results, and return concise information to the lead, which compiles the final answer.

Unlike static RAG, this architecture performs multi‑step, dynamic searches that adapt to new findings.

The lead agent aggregates sub‑agent outputs, decides if further research is needed, and finally passes the compiled findings to a citation agent that adds proper references before returning the result to the user.

3. Prompt Engineering and Evaluation

Key prompt principles include:

Think like your agent.

Teach the orchestrator how to delegate.

Scale workload to query complexity.

Design and choose tools carefully.

Enable agents to improve themselves.

Start broad, then narrow.

Guide the reasoning process.

Parallel tool calls boost speed and performance.

Evaluation must be flexible: multi‑agent systems can reach the correct result via many different paths. Small‑scale sample tests (≈20 queries) are effective for early iteration, while LLM‑as‑judge scoring provides scalable assessment of factual accuracy, citation quality, completeness, source quality, and tool efficiency.

4. Production Reliability and Engineering Challenges

Agents are stateful and errors are compound, making debugging harder than in traditional software. Robust observability, checkpointing, and “rainbow deployments” (gradual traffic shifting) are essential to avoid disruptions.

Synchronous execution creates bottlenecks; moving to asynchronous parallelism can further improve throughput but adds coordination complexity.

Conclusion

Building reliable AI agents requires extensive engineering, thorough testing, careful prompt and tool design, strong operational practices, and close collaboration between research, product, and engineering teams. Despite higher token costs, multi‑agent systems have proven valuable for open‑ended research tasks, delivering faster, more comprehensive results and enabling users to uncover insights they could not find alone.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

prompt engineering multi-agent systems LLM research AI Architecture evaluation methods

Written by

Data Thinking Notes

Sharing insights on data architecture, governance, and middle platforms, exploring AI in data, and linking data with business scenarios.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.