Artificial Intelligence 10 min read

Claude’s Pass Rate Under 4%: SaaS‑Bench Shatters the “Fully Automated Office” Dream

SaaS‑Bench evaluates AI agents on 23 real SaaS applications and 106 cross‑app, long‑horizon tasks, revealing that even the strongest model, Claude Opus 4.7, passes fewer than four percent of tasks and exposing four structural failure modes that separate benchmark scores from true office productivity.

Machine Heart

May 25, 2026

Claude’s Pass Rate Under 4%: SaaS‑Bench Shatters the “Fully Automated Office” Dream

Recent hype around AI agents promises fully automated office work, yet real‑world tasks require understanding business goals, navigating multiple applications, maintaining state, and completing hundreds of steps accurately.

SaaS‑Bench addresses this gap by containerizing 23 open‑source SaaS systems (e.g., OpenProject, Metabase, OpenEMR, Mattermost) with full front‑end, back‑end, and database logic. The suite covers six domains and populates each system with realistic business data. Using an "LLM‑generated + expert‑reviewed" pipeline, 106 tasks were created, 93.4% of which span at least two applications; half involve three apps, and 32 tasks require multimodal understanding. Typical task trajectories exceed 100 steps, with the longest over 300 steps.

Agents interact with these environments via a Browser‑Use interface and are scored on two metrics: Resolved Score (strict, all checkpoints must pass) and Checkpoint Score (weighted partial credit). The evaluation runs each model three times per task, counting a task as passed if any run succeeds.

Results show a stark performance gap. Claude Opus 4.7 achieves the highest checkpoint score of 43.9% but a resolved score of only 3.8%, fully passing just 4 of 106 tasks. DeepSeek V4, M2.7, and GLM5.1 (text‑only models) score lower, while Kimi K2.5 and Gemini 3.1 Pro achieve zero resolved score. Running models three times improves pass rates by roughly eight percentage points; Sonnet 4.6’s checkpoint score rises from 33.9% to 52.1% in the multi‑run setting, yet its execution remains highly unstable.

Four structural failure modes explain the low scores:

Failure 1 – Length decay: Success probability drops sharply as task length grows; even with a 95% per‑checkpoint pass rate, the chance of completing 12 checkpoints drops to 54%, and SaaS‑Bench tasks often exceed this.

Failure 2 – Error propagation: A single mistake (e.g., creating a customer under the wrong entity) cascades, causing downstream errors that account for up to 30% of total score loss.

Failure 3 – Missing verification: Agents may believe a step succeeded (e.g., correcting a date) but fail to re‑inspect the UI, leaving the system in an incorrect state.

Failure 4 – Run‑to‑run variance: Identical initial states still produce divergent trajectories; Claude Sonnet 4.6 scores range from 0.00 to 0.68 across three runs, highlighting path‑dependence.

Correlation analysis confirms that scores decline as the number of crossed applications rises (1→4 apps: 53% → 20%), as step count increases, and as the number of checkpoints grows (≤6 vs ≥18 checkpoints: 65% → 27%). These dimensions mirror the complexity of genuine office workflows.

The study concludes that current AI‑agent benchmarks vastly overestimate real‑world capability. Agents lack persistent‑state reasoning, closed‑loop verification, and robust error recovery. Consequently, the authors argue that SaaS designed for humans must be re‑engineered for agents, and that the observed failure patterns represent deeper limits of the present agent paradigm.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI agents benchmarking Claude Opus long-horizon tasks cross-application SaaS-Bench

Written by

Machine Heart

Professional AI media and industry service platform

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.