Why AI Agents Fail Beyond Hallucinations
The article catalogs dozens of AI agent failure modes—from one‑shot attempts and cold‑start amnesia to hidden harness control—and explains why these issues quickly overwhelm developers, then outlines concrete mitigation strategies and their trade‑offs.
Failure Modes
One‑shotting : trying to finish an entire app in a single pass exhausts the context window and leaves a half‑baked product (Anthropic long‑running agent).
Progress‑as‑completion : mistaking any repository activity for task completion (Anthropic long‑running agent).
Cold‑start amnesia : new sessions start with no memory or playbook, forcing costly guesswork (Anthropic long‑running agent).
Ugly wish‑granting : vague prompts cause the agent to implement literally what it hears, often producing worse results than the original request (author observation).
Spec‑deliverable confusion : treating temporary design artifacts as final deliverables, bundling scaffolding with the intended product (author observation).
Default‑fill slop : unspecified parts are filled with generic code, UI, or product choices learned from training data (Mario Zechner; Anthropic app harness).
Overengineering by default : agents add unnecessary abstraction layers, duplicate code, and backward‑compatibility guards because that is what they see in open‑source code (Mario Zechner).
Working‑memory rot : important information degrades as the context window grows (Random Labs Slate).
Hidden harness control : tools silently modify prompts, context, or observability without user awareness (Mario Zechner).
Lossy compaction : state compression discards crucial information unpredictably (Random Labs Slate).
Local patching : each step looks reasonable, but the overall system becomes increasingly hard to reason about (Mario Zechner).
Summary‑only handoff loss : sub‑agents hand over thin summaries instead of full state, breaking integration (Random Labs Slate).
Async reconciliation failure : parallel work creates ambiguity about which results are final and how to combine them (Random Labs Slate).
Blind N‑step execution : long‑running task blocks receive no feedback until they hit a wall (Random Labs Slate).
Plan drag : rigid plans resist adaptation when reality changes, leading to stale task trees (Random Labs Slate).
Overdecomposition : layered planner/executor/reviewer architecture works technically but adds bureaucracy, latency, and inertia (Random Labs Slate).
Validation interruption : injecting diagnostics mid‑edit confuses the model before a coherent change is formed (Mario Zechner).
False E2E completion : unit tests or curl requests pass while the real user flow remains broken (Anthropic long‑running agent).
Functional but wrong : output passes checks but feels awkward, overly complex, or deviates from task intent (author’s earlier article).
Self‑review softness : agents over‑rate their own mediocre work (Anthropic app harness).
Modality blind spots : QA tools cannot see, hear, or interact like real users, missing related bugs (Anthropic app harness).
Why It’s Painful
Two related problems make these failures especially draining. First, agents can generate code, tests, issues and pull requests faster than humans can review them, shifting the bottleneck from typing to judgment; without human oversight, critical knowledge is lost. Second, the same flood of AI‑generated artifacts spills beyond the codebase into issue trackers, PR comments, auto‑generated documentation, and public posts, creating an "AI trash" problem that overwhelms reviewers.
The practical rule emerging from this analysis is to keep generated work within a reviewable scope, use agents where validation cost is low, and retain enough human understanding to say "no" when necessary.
Fixes and Their Trade‑offs
Context reset : mitigates long‑task drift but can cause handoff artifacts to become critical state, making subsequent sessions fragile.
Compaction : helps sustain long‑running agents yet risks unpredictable loss of important state.
Feature/task lists : prevents one‑shotting and premature completion, but can harden plans, cause stale state, and turn check‑boxes into formality.
Strict task tree : enables early termination and complete decomposition, but reduces expressive power and hampers adaptation to changing reality.
Sub‑agents : provide context isolation and parallel exploration, yet produce thin summaries, limited messaging, and merging challenges.
Independent evaluator : counters self‑praise and weak review, but may still miss issues and generate "shape‑conforming" garbage.
Browser/E2E testing : catches local pass‑but‑fail scenarios, though tool blind spots and perception limits remain.
User‑maintained minimal framework : reveals hidden vendor behavior and shallow extensibility, pushing safety and workflow costs back to the user.
References
Anthropic, "Effective harnesses for long‑running agents" (Nov 2025).
Anthropic, "Harness design for long‑running application development" (Mar 2026).
Random Labs, "Slate: moving beyond ReAct and RLM" (Mar 2026).
Mario Zechner, "Building Pi in a World of Slop" (AI Engineer Conference).
"Long‑Horizon Agents Are Here. Full Autopilot Isn't."
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
AI Engineer Programming
In the AI era, defining problems is often more important than solving them; here we explore AI's contradictions, boundaries, and possibilities.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
