From /goal to Long‑Running Asynchronous Agents: Making AI Sustainably Deliver Complex Tasks
By experimenting with OpenAI’s /goal feature, the author shows how to turn ad‑hoc AI prompts into a structured, long‑running loop that records progress in Git, README and test artifacts, enabling agents to handle complex engineering tasks across multiple sessions with clear checkpoints and human‑in‑the‑loop control.
Introduction: Three /goal Experiments
Using OpenAI Codex’s /goal, the author tried three tasks (Office, IDA, Contra) and observed that the agent could not solve a large problem in a single conversation; instead it needed a loop that pushes the problem into the engineering reality.
These experiments demonstrated that complex tasks must be broken into a system that can be handed off, verified, and corrected across multiple agent runs.
Loop: Regaining Control from Determinism
Long‑running agents are limited by a single chat window. Addy Osmani defines a long‑running agent as one that can continue for hours or days, crossing context windows while producing structured artifacts. The key is “recovery” and “structured artifacts” because models forget, sessions end, sandboxes reset, and commands can fail.
The model itself may be short‑lived, but the workspace must persist: tasks, plans, validation results, failure logs, and Git history remain so the next agent round does not have to guess the current state.
Ralph Loop exemplifies the minimal version: read PRD, pick the next unfinished task, call a coding agent, run tests, update a progress file, commit, and start the next round. The agent does not need to stay online; the work state must be catchable by the next round.
The often‑underestimated component is Memory: README, PLAN, progress files, test reports, and Git logs are the real enablers for long‑running tasks.
Designing Long‑Running Goals
Goals are split into categories based on acceptance criteria. The first class mimics Codex’s Office capability: a Node.js wrapper ( packages/office) that uses .NET 9 browser‑wasm to read DOCX, PPTX, XLSX and exposes functions via an external API.
The second class replicates the IDA demo with Next.js, breaking “like IDA” into verifiable targets: a lib/pe.ts that analyses real samples/kernel32.dll, an /api/analyze endpoint for GET/POST, and a UI component IdaWorkbench.tsx that shows functions, hex view, imports, exports, strings, segment map, and output. Validation is performed by a suite of npm run test commands comparing details, text, functions, right‑views, coverage, and UI.
The third class is a playful “Contra” task: building an NES emulator from scratch to run the game Contra . The larger the name, the more specific the stop condition must be.
Artifacts: Git, Reports and Failure Records
Long‑running agents inevitably fail. They may take wrong paths, misjudge completion, or introduce new errors while fixing old ones. The crucial question is whether anything is left after a failure.
Git acts as a black‑box beyond version control: it records what changed, which round introduced risk, which verification passed, which rollback occurred, and which commit can serve as a recovery point. Anthropic’s harness article also stresses that incremental work, progress logs, and Git commits reduce the cost of re‑guessing state for the next round.
Reports serve a similar purpose. The IDA demo README lists both passed and failed items, informing the next agent that the 14 lines of RtlVirtualUnwind are aligned (passed) while pixel parity is still pending (failed).
These artifacts belong in the /insights layer, providing a post‑mortem for each round: extracting failure paths, evidence, implicit rules, and next steps to thicken project memory.
Complex Tasks Require Human Intervention
Long‑running agents must operate within a constrained loop that includes planning, verification, permission boundaries, logging, checkpoints, and points where humans can intervene.
OpenAI’s long‑horizon Codex loop consists of: plan, edit code, run tools/tests/build/lint, observe, repair, update docs/status, repeat. Each step generates external evidence (command output, diffs, test reports, screenshots, README status) that outweighs the model’s confidence.
When designing complex tasks, the author asks three practical questions:
Can the goal be split into 3‑7 checkpoints?
Does each checkpoint have a verification command or observable artifact?
Which files form the public boundary that must not be broken for local passes?
Missing any of these leads to the next agent spending extra time re‑entering the problem or, worse, re‑proving already proven facts.
Future: Managing an Asynchronous Engineering Team
Comparing Ralph Loop, Claude Code, and Codex reveals a common direction: Ralph Loop proves the minimal loop (task list, progress file, Git, fresh context); Claude Code injects loop, hooks, skills, and insights into daily developer tools; Codex’s /goal attaches a durable objective to a thread, allowing agents to progress across turns with /goal‑pause, resume, and clear controls.
The shared management challenge is coordinating multiple asynchronous executors on the same set of engineering evidence.
The ultimate takeaway is: do not try to make the agent remember everything; instead, design the workspace so the next round can pick up where the previous left off.
https://developers.openai.com/codex/use-cases/follow-goals
https://developers.openai.com/blog/run-long-horizon-tasks-with-codex
https://addyosmani.com/blog/long-running-agents/
https://github.com/snarktank/ralph
https://code.claude.com/docs/en/commands
https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
phodal
A prolific open-source contributor who constantly starts new projects. Passionate about sharing software development insights to help developers improve their KPIs. Currently active in IDEs, graphics engines, and compiler technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
