Why AI’s “I’ve Tested It” Isn’t Enough: Implementing a Verification Gate Workflow
The article explains that AI agents often claim tasks are complete without providing verifiable evidence, and introduces a Verification Gate that requires concrete command, result, coverage, and risk information—structured by risk‑based layers, hooks, and subagents—to ensure honest and traceable completion of AI‑driven code changes.
You may have seen AI agents finish coding and reply, “Done, tests passed,” yet when you ask which tests ran, they only say the logic looks fine, providing no command logs, outputs, or scope of verification. This lack of evidence is the real danger, not the absence of testing.
Verification Gate is defined as “no retrievable evidence, not completed.” It mandates that a completion claim must be accompanied by a structured verification receipt.
Verification record example: <code>Verification record: - Command: npm test -- auth.spec.ts - Result: Passed, 12 tests passed - Coverage: login form validation, token refresh, error prompts - Not covered: real OAuth callback (requires staging review) - Risk: no DB schema changes, no payment module impact</code>
The receipt must include command, result, coverage, and any unverified items. It is not a requirement to run full CI each time, but the verification strength must match the risk level of the change.
Risk‑Based Verification Layers
Static layer : suitable for documentation, formatting, type linting; use formatter, lint, typecheck.
Unit layer : for pure functions or component utilities; use unit tests, snapshot tests.
Integration layer : for API, database, state flow; use integration tests, local server, SQL checks.
Experience layer : for user paths, releases, rich text, payment login; use browser checks, manual preview, staging acceptance.
Choosing the appropriate layer prevents over‑testing low‑risk changes while ensuring high‑risk modifications receive thorough verification.
Review‑Repair‑Validate Loop
Effective AI coding follows three steps:
Review: identify problems and evidence.
Repair: make minimal fixes.
Validate: run checks, capture feedback, and feed it back into the next repair round.
Example for a front‑end bug:
Review:
- Reproduce missing state refresh after button click.
- Locate useCheckout.ts, CheckoutButton.tsx.
Repair:
- Fix cache invalidation after mutation success.
- Avoid unrelated component refactor.
Validate:
- Run checkout‑related tests.
- Open local page and perform click path.
- Record test command and page check results.If validation fails, the agent must treat the failure as new input rather than asserting theoretical correctness.
Stop Hook Enforcement
Hooks can block an agent from finishing without a verification receipt. A minimal Bash Stop hook checks for the presence of a verification record and aborts with exit code 2 if missing:
#!/usr/bin/env bash
message_file="$1"
if ! grep -q "验证记录" "$message_file"; then
echo "请补充验证记录:command、result、coverage、not verified。" >&2
exit 2
fiGate checks can be staged by cost:
Text receipt: low cost, just ensure verification fields are present.
Command receipt: low cost, verify actual command and result are listed.
File receipt: medium cost, generate .agent/verification.md.
Script receipt: medium cost, ensure commands like npm test or typecheck run.
CI receipt: high cost, require PR status to be passing.
Permission Gate vs. Verification Gate
Permission gates answer “Can this action be performed?” while verification gates answer “Is there evidence the result meets requirements?” A 2026 paper measuring the Permission Gate reports an 81 % false‑negative rate for Claude Code Auto Mode, reminding that permission checks do not guarantee result quality.
Three‑Layer Governance
Permission Gate – allow/deny, sandbox, approval.
Isolation Gate – where the task runs (worktree, container, profile).
Verification Gate – tests, reviews, receipts, CI.
Verifier Subagent
A read‑only verifier subagent performs three actions:
Read diff and related files.
Compare against user requirements and project rules to find gaps.
Return an evidence report indicating whether the task can be completed.
Prompt template example:
请作为只读 verifier 检查当前变更。
范围:
- 只读 git diff、相关测试和项目规则。
- 不要修改文件。
- 不要提出大范围重构建议。
检查:
- 用户要求是否全部覆盖?
- 是否有未验证但被宣称已验证的内容?
- 验证命令是否匹配改动风险面?
- 是否有明显回归、遗漏测试或文档不一致?
输出:
- verdict: pass / needs-work
- evidence: 3‑5 条具体证据
- missing verification: 未验证项
- required next action: 下一步必须做什么The verifier should remain read‑only to keep responsibility boundaries clear.
Minimal Template for Completion Contract
Add the following to AGENTS.md or team rule files:
## Completion Contract
Before claiming completion, provide a verification receipt:
- changed: what files or behavior changed
- command: exact commands run, or "not run"
- result: pass/fail/blocked with key output
- coverage: what the verification actually covers
- not verified: what remains unverified and why
- next risk: what a human should review first
Do not say tests passed unless a test command actually ran.
If verification is blocked, stop and report the blocker instead of claiming success.Agents must then output a structured receipt (changed, verification, not verified, human review focus). This forces honest reporting and prevents “looks‑complete” false completions.
When verification cannot be performed, agents should explicitly list missing items, e.g.:
未验证:
- 没有运行 e2e,因为本地缺少 TEST_DB_URL。
- 没有验证微信草稿箱封面,因为浏览器发布流程只插入正文图。
- 没有跑全量测试,因为当前任务只修改文档和 HTML。The key principle:
AI can drive work, but completion must be signed off with evidence.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ArcThink
ArcThink makes complex information clearer and turns scattered ideas into valuable insights and understanding.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
