Why Code Is the Core of Agent Harness: Deep Insights from UIUC, Meta, and Stanford

Recent coding agents like Claude Code and Codex expose a deeper challenge: beyond generating correct code, agents must manage long‑term tasks by continuously planning, executing, testing, and updating code, making code the executable, inspectable, stateful medium that powers the Agent Harness framework.

Machine Heart
Machine Heart
Machine Heart
Why Code Is the Core of Agent Harness: Deep Insights from UIUC, Meta, and Stanford

Overview

Over the past two years, large‑language‑model (LLM) coding assistants have become commonplace, and their success has traditionally been measured by whether the generated code compiles and passes tests. Systems such as Claude Code and Codex push the problem deeper: a truly capable coding agent must operate over long time windows, reading repositories, planning, modifying files, running commands, inspecting errors, fixing failures, and maintaining context across many feedback cycles.

The underlying execution system that enables such long‑term reliability is called Agent Harness . While most discussions focus on the external components that surround the model—tools, APIs, sandboxes, memory, permission boundaries, validators, and feedback loops—this survey asks a more fundamental question: When an agent is placed in a long‑term task environment, what concrete object ties together reasoning, action, feedback, verification, and collaboration? The answer, as argued by researchers from UIUC, Meta, and Stanford, is code .

Why Code Serves as the Core Medium

Code possesses three properties that natural language lacks: it is executable , inspectable , and stateful . Executable code turns model intent into real operations such as shell commands, patches, or test scripts. Inspectable code generates objective feedback—compile errors, runtime exceptions, test results, logs, and traces—that the system can observe without relying on the model’s self‑explanations. Stateful code persists progress in repositories, file systems, configuration files, commit histories, and skill libraries, allowing the agent to remember what has been done and where to continue.

Consequently, the survey positions code as the most stable and manipulable state carrier within a harness, shifting the focus from merely listing tools to centering the execution loop around code.

Code as the Interface Layer

At the first layer— Harness Interface —code bridges the model and the external world:

Reasoning becomes executable: Prior approaches such as Chain‑of‑Thought, Program‑of‑Thought (PoT), and PAL translate intermediate reasoning into programs; proof assistants like Lean and Coq turn reasoning into machine‑checkable proofs.

Actions become concrete: Agents like Claude Code modify files, run tests, inspect errors, and generate patches; works such as SayCan, Code as Policies, and Voyager map language goals to skill calls, control scripts, or reusable functions.

Environment modeling becomes code‑driven: Execution environments (e.g., SWE‑bench, AgentBench) expose repository state, test outcomes, logs, and simulators as structured inputs that the agent can reason over.

After integrating code, reasoning is no longer pure text, actions are no longer mere promises, and environments are no longer just descriptions—they all become executable, checkable, and updatable states.

Managing State and Feedback with Code

Real‑world tasks rarely finish in a single step. Bug fixing may require repeated locate‑modify‑test‑rollback cycles; web automation may span multiple pages and tools; scientific experiments involve hypothesis, simulation, data analysis, and iterative adjustment. The harness therefore organizes these steps into a Plan‑Execute‑Verify loop:

Planning: Plans are materialized as Plan.md, workflow files, or executable task graphs.

Memory: Relevant repository evidence, execution logs, failure experiences, and historical patches are stored, compressed, or off‑loaded as external state.

Tool use: Agents invoke terminals, sandboxes, test frameworks, or static analyzers to affect the external world.

Systems such as SWE‑agent and OpenHands exemplify this loop by repeatedly performing write code → run → fail → fix, turning errors, test failures, and logs into feedback sensors that guide convergence.

Code as the Shared Substrate for Multi‑Agent Collaboration

When a task exceeds the capacity of a single agent, multiple agents assume roles (manager, planner, coder, tester, reviewer). The main difficulty lies in sharing a consistent world state. Pure chat‑based coordination leads to divergent understandings of code changes, test failures, and applied patches.

Code provides a stable shared substrate: repositories, tests, pull requests, CI logs, review comments, and execution traces become common read‑write artifacts. Multi‑agent systems therefore rely on executable, shared code state rather than natural‑language dialogue as their lingua franca.

Beyond Coding Agents

While the code‑centric harness first emerged in software engineering, the same principles extend to GUI/OS agents (using DOM trees, accessibility trees, Playwright scripts), robotic agents (transforming language intents into skill libraries and control scripts), and scientific discovery pipelines (organizing hypotheses, simulations, and data analysis as code‑driven workflows). The authors anticipate that future agents, even those not explicitly labeled “coding agents,” will operate atop a code‑centric harness.

Open Problems

Evaluating long‑term agents requires more than final‑output metrics. Open questions include:

How to design harness‑level benchmarks that assess planning, tool invocation, state transitions, and feedback usage?

How to handle incomplete feedback where passing tests do not guarantee correctness?

How to achieve regression‑free self‑evolution without introducing new failure modes?

How to resolve semantic conflicts when multiple agents share state?

How to make human‑in‑the‑loop interactions auditable, accountable, and verifiable within the system state?

In summary, the paper argues that the next step for AI agents is not merely improving answer quality, but making the entire code‑driven execution pipeline more inspectable, recoverable, and governable.

Reference: "Code as Agent Harness" (arXiv:2605.18747) and accompanying GitHub repository https://github.com/YennNing/Awesome-Code-as-Agent-Harness-Papers.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AI agentssoftware engineeringmulti‑agent collaborationagent orchestrationcode harness
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.