Machine Heart
May 25, 2026 · Artificial Intelligence
Claude’s Pass Rate Under 4%: SaaS‑Bench Shatters the “Fully Automated Office” Dream
SaaS‑Bench evaluates AI agents on 23 real SaaS applications and 106 cross‑app, long‑horizon tasks, revealing that even the strongest model, Claude Opus 4.7, passes fewer than four percent of tasks and exposing four structural failure modes that separate benchmark scores from true office productivity.
AI agentsBenchmarkingClaude Opus
0 likes · 10 min read
