Harvard’s AutoScientists Lets AI Agents Self‑Organize Research Teams and Outperform Traditional AI Agents

AutoScientists, a Harvard‑built system where nine AI agents self‑organize via a shared state without a central commander, achieves a 74.4% average rank on BioML‑Bench, runs GPT training experiments 1.9× faster, and improves ProteinGym fitness prediction by 12.5%, while ablation studies reveal the critical role of each of its four core mechanisms.

SuanNi
SuanNi
SuanNi
Harvard’s AutoScientists Lets AI Agents Self‑Organize Research Teams and Outperform Traditional AI Agents

Scientific research is rarely linear; it resembles a branching forest where multiple paths must be explored simultaneously. Harvard’s AutoScientists system creates a group of AI agents that read a shared state, decide which directions to pursue, form teams, discuss proposals, run experiments, and reconvene to update the shared state—without any central commander.

No Commander Team

Traditional AI research agents fall into two camps: single‑agent approaches like AIDE and Autoresearch that iterate along one search path, and multi‑agent systems that still rely on a planner or voting mechanism to assign tasks. In long‑term scientific experiments, valuable directions evolve with results, making a fixed planner ineffective. AutoScientists replaces the commander with a decentralized process: nine agents read a shared state, independently choose promising directions, self‑organize into teams, and execute experiments in parallel.

The system alternates between two phases. During the discussion phase, all agents analyze the task, propose candidate directions, critique each other, and form several teams, each responsible for a direction. In the execution phase, teams run experiments in parallel, write results back to the shared state, and when a direction stalls, agents restart discussion, possibly splitting, merging, or opening new routes.

The shared state has four layers: Champion (records the current best model and reproducibility instructions), Experiment Log (captures results, metric changes, and training details), Forum (a structured discussion board where agents debate proposals and share analyses), and per‑team queues plus a Dead‑end Registry (records failed directions for other teams to read).

Each agent runs on a heartbeat cycle: read the shared state, act according to its role, and write back results.

Two agent roles exist. Analyst agents maintain search knowledge, review the experiment log for unexplored directions, and propose proposals prioritized by observed effect size; persistently low‑impact directions are down‑weighted. Experiment agents pull tasks from the queue, apply code changes to the current champion program, run training, and record outcomes. Because evaluation metrics can fluctuate, improvements within a noise band are confirmed with a second random seed before being promoted to the new champion.

Three Tracks Lead

AutoScientists was evaluated on three distinct scientific tracks and outperformed the previous strongest AI agents on all of them. On BioML‑Bench (24 end‑to‑end biomedical tasks), it achieved a 74.4% average leaderboard percentile, an 8.33‑point gain over Autoresearch’s 66.07%. The biggest jump was in drug discovery, rising from 47.91% to 64.52%.

In GPT training optimization (GPT‑nanochat), AutoScientists reached the target bits‑per‑byte loss in 34 experiments, 1.9× faster than Autoresearch’s 65 experiments. This speedup stemmed from three parallel teams focusing on architecture, learning‑rate scheduling, and optimizer, allowing simultaneous multi‑direction progress.

For ProteinGym fitness prediction, starting from the strong supervised baseline Kermut, AutoScientists devised a three‑Gaussian‑process ensemble that combined Kermut’s structural kernel, extended zero‑shot features, diversity‑driven greedy feature selection, and quantile‑transformed targets. On the ACE2‑Spike binding assay, Spearman correlation rose from 0.747 to 0.840 (+12.5%). The same formulation, frozen and applied to all 217 ProteinGym assays, lifted average Spearman from 0.657 to 0.700 (+6.5%).

Every Component Matters

Ablation experiments show that removing any of the four core mechanisms degrades performance, though the impact varies by task. Without the Analyst, TDC‑hERG AUROC drops from 0.867 to 0.738 (percentile from 85.7% to 14.3%). Without cross‑agent feedback, plasma‑protein Pearson correlation falls from 0.873 to 0.714. Without self‑organization, GPT training bits‑per‑byte worsens from 0.9777 to 0.9833. Removing the shared state causes the cell‑communication Odds Ratio to plunge from 0.924 to 0.435. These components complement each other: Analyst improves proposal quality, feedback fills information gaps, self‑organization adapts to shifting search directions, and shared records prevent duplicated effort.

Figure 5 (illustrated in the original article) visualizes emergent collaborative behaviors during long‑cycle searches, such as diverse agent proposals, identification of saturated directions, cross‑team hypothesis transfers, and dead‑end exits, with real agent discussion excerpts.

Limitations

AutoScientists consumes more LLM tokens than Autoresearch because multiple agents reason, discuss, and reorganize teams concurrently. Its design goal is not to reduce API calls but to achieve better search quality under a fixed compute budget.

In the BioML‑Bench evaluation, each task received a single H100 GPU, limiting parallel GPU‑intensive experiments; thus AutoScientists’ parallelism was not fully exploited. Additionally, the number of agents is fixed before execution; future work may adapt team size dynamically based on task difficulty.

Overall, AutoScientists demonstrates that a decentralized, self‑organizing AI agent team can outperform both single‑agent and centrally‑coordinated approaches across diverse scientific domains.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AI agentsAI researchAutoScientistsBioML BenchGPT training optimizationProteinGymself-organizing
SuanNi
Written by

SuanNi

A community for AI developers that aggregates large-model development services, models, and compute power.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.