Porting Karpathy’s AutoResearch to Software Development: Explosive Results

The project adapts Karpathy’s AutoResearch method to software development by using multi‑agent cross‑review, a five‑dimensional weighted scoring system, and feedback‑driven iteration, enabling fully automated issue handling, testing, and PR creation in about ten minutes with a 9.0/10 code‑quality score.

High Availability Architecture
High Availability Architecture
High Availability Architecture
Porting Karpathy’s AutoResearch to Software Development: Explosive Results

This article describes how the AutoResearch methodology originally created by Andrej Karpathy for AI research was transplanted into the software development domain, resulting in a fully automated development pipeline.

Core Idea

The system treats a GitHub issue as the sole human input. An program.md file defines the rules, goals, and constraints. The pipeline then automatically performs issue identification, code generation, testing, review, and merging, requiring human intervention only in rare edge cases.

Three Key Improvements

Multi‑Agent Cross Review : Instead of a single agent self‑reviewing, two agents (Codex and Claude) alternate as implementer and reviewer, exposing blind spots and improving code quality.

Five‑Dimensional Weighted Scoring : Quality is quantified across correctness (35%), tests (25%), code style (20%), security (10%) and performance (10%). A total score ≥ 9.0/10 triggers automatic commit, PR creation, and merge.

Feedback‑Driven Iteration : Review feedback is fed into the next iteration’s prompt, allowing targeted fixes instead of blind retries.

System Architecture

Architecture diagram
Architecture diagram

Optimization Loop – Four Phases

Phase 1 – Environment Setup : Install dependencies, fetch the issue, create a branch.

Phase 2 – Core Iteration : Agents alternate (odd rounds: Codex implements → Claude reviews; even rounds: Claude implements → Codex reviews), run tests, compute the weighted score, and either proceed to the next round or move to Phase 3.

Phase 3 – Automatic Submission : When the score reaches the threshold, the system commits, pushes, creates a PR, and merges it.

Phase 4 – Archiving : All logs, test results, and iteration details are written to results.tsv and per‑issue log files for traceability.

Scoring Details

The scoring table (shown below) maps issue‑level findings to a 0‑10 scale; the weighted sum determines whether the iteration passes.

Scoring matrix
Scoring matrix

Example Iterations

Iteration 1 (Codex): Score 1.0 – implementation incomplete.
Iteration 2 (Claude): Score 5.0 – timeout control added, but still missing parts.
Iteration 3 (Codex): Score 9.0 – all criteria met, PR auto‑merged.

For Issue #21 the whole process took about ten minutes and three iterations, achieving a final score of 9.0/10.

Termination Conditions

Score ≥ 9.0 → commit and merge.

Score < 9.0 → feedback drives the next round.

Test failure → feedback "test failed" triggers a corrective round.

Three consecutive agent failures → abort.

API errors → exponential back‑off with jitter (max 60 s, 10 retries).

Quick Start

# Install prerequisites
gh auth status          # GitHub CLI
which acpx              # Agent control tool
go version              # Go runtime

# Run the pipeline on issue 21 (max 10 iterations)
./run.sh 21 10

The script automatically checks the environment, fetches the issue, creates a branch, runs the alternating Codex/Claude cycles, and merges when the quality threshold is met.

Best Practices

Start with small, well‑scoped issues.

Keep program.md up‑to‑date to reflect new constraints.

Monitor the scoring trend in log.md to ensure steady improvement.

Leverage the multi‑agent adversarial setup to catch blind spots.

Rely on the built‑in exponential back‑off for unstable API calls.

Design Inspiration

The project combines three open‑source components:

karpathy/autoresearch – core loop with quantitative improvement criteria.

acpx – command‑line tool that lets Claude and Codex act as agents.

smallnest/imclaw – the target repository where the automated changes are applied.

By unifying these tools, the system demonstrates that AI‑driven autonomous development is feasible for real‑world Go projects.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

code generationAI AgentsGosoftware automationAutoResearchmulti-agent reviewGitHub CI
High Availability Architecture
Written by

High Availability Architecture

Official account for High Availability Architecture.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.