Porting Karpathy’s AutoResearch to Software Development: Explosive Results
The project adapts Karpathy’s AutoResearch method to software development by using multi‑agent cross‑review, a five‑dimensional weighted scoring system, and feedback‑driven iteration, enabling fully automated issue handling, testing, and PR creation in about ten minutes with a 9.0/10 code‑quality score.
This article describes how the AutoResearch methodology originally created by Andrej Karpathy for AI research was transplanted into the software development domain, resulting in a fully automated development pipeline.
Core Idea
The system treats a GitHub issue as the sole human input. An program.md file defines the rules, goals, and constraints. The pipeline then automatically performs issue identification, code generation, testing, review, and merging, requiring human intervention only in rare edge cases.
Three Key Improvements
Multi‑Agent Cross Review : Instead of a single agent self‑reviewing, two agents (Codex and Claude) alternate as implementer and reviewer, exposing blind spots and improving code quality.
Five‑Dimensional Weighted Scoring : Quality is quantified across correctness (35%), tests (25%), code style (20%), security (10%) and performance (10%). A total score ≥ 9.0/10 triggers automatic commit, PR creation, and merge.
Feedback‑Driven Iteration : Review feedback is fed into the next iteration’s prompt, allowing targeted fixes instead of blind retries.
System Architecture
Optimization Loop – Four Phases
Phase 1 – Environment Setup : Install dependencies, fetch the issue, create a branch.
Phase 2 – Core Iteration : Agents alternate (odd rounds: Codex implements → Claude reviews; even rounds: Claude implements → Codex reviews), run tests, compute the weighted score, and either proceed to the next round or move to Phase 3.
Phase 3 – Automatic Submission : When the score reaches the threshold, the system commits, pushes, creates a PR, and merges it.
Phase 4 – Archiving : All logs, test results, and iteration details are written to results.tsv and per‑issue log files for traceability.
Scoring Details
The scoring table (shown below) maps issue‑level findings to a 0‑10 scale; the weighted sum determines whether the iteration passes.
Example Iterations
Iteration 1 (Codex): Score 1.0 – implementation incomplete.
Iteration 2 (Claude): Score 5.0 – timeout control added, but still missing parts.
Iteration 3 (Codex): Score 9.0 – all criteria met, PR auto‑merged.For Issue #21 the whole process took about ten minutes and three iterations, achieving a final score of 9.0/10.
Termination Conditions
Score ≥ 9.0 → commit and merge.
Score < 9.0 → feedback drives the next round.
Test failure → feedback "test failed" triggers a corrective round.
Three consecutive agent failures → abort.
API errors → exponential back‑off with jitter (max 60 s, 10 retries).
Quick Start
# Install prerequisites
gh auth status # GitHub CLI
which acpx # Agent control tool
go version # Go runtime
# Run the pipeline on issue 21 (max 10 iterations)
./run.sh 21 10The script automatically checks the environment, fetches the issue, creates a branch, runs the alternating Codex/Claude cycles, and merges when the quality threshold is met.
Best Practices
Start with small, well‑scoped issues.
Keep program.md up‑to‑date to reflect new constraints.
Monitor the scoring trend in log.md to ensure steady improvement.
Leverage the multi‑agent adversarial setup to catch blind spots.
Rely on the built‑in exponential back‑off for unstable API calls.
Design Inspiration
The project combines three open‑source components:
karpathy/autoresearch – core loop with quantitative improvement criteria.
acpx – command‑line tool that lets Claude and Codex act as agents.
smallnest/imclaw – the target repository where the automated changes are applied.
By unifying these tools, the system demonstrates that AI‑driven autonomous development is feasible for real‑world Go projects.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
High Availability Architecture
Official account for High Availability Architecture.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
