First Enterprise IT Ops Agent Benchmark Shows Claude Leads with Just 47% Score
The ITBench-AA benchmark, the first evaluation specifically for enterprise IT operations agents, tests 59 SRE scenarios and reveals that even top models like Claude Opus 4.7 achieve only a 47% score, highlighting both the difficulty of the tasks and the cost‑effectiveness gap between proprietary and open‑source agents.
ITBench-AA Overview
ITBench-AA is a benchmark for enterprise‑level IT operations agents, built on IBM’s ITBench dataset. It targets Site Reliability Engineering (SRE) tasks that require understanding of Kubernetes resources, Prometheus alerts, distributed tracing, and micro‑service topologies.
Dataset and Task Definition
59 SRE questions (40 public, 19 hidden to prevent memorization).
Each question provides a Kubernetes fault snapshot containing alerts, events, traces, metrics, logs, and the application topology.
The agent must output the minimal set of independent root‑cause entities (e.g., Deployment, Service, Pod, NetworkPolicy).
Scoring Method
Scoring uses all‑recall average precision. If any true root cause is missing, the question score is zero; otherwise precision = correct entities ÷ total submitted entities. The final score is the average over three repeated runs for each question.
Evaluation Harness
All models run on the open‑source Stirrup harness. Each task is limited to 100 dialogue turns and executed three times to smooth randomness. Agents interact with a sandboxed file system via shell commands and submit a structured JSON diagnosis.
Results
Claude Opus 4.7 (adaptive reasoning, max effort) – 47%.
GPT‑5.5 – 46%.
Qwen 3.7 Max – 42%.
Open‑source: GLM‑5.1 – 40%; Gemini 3.5 Flash – 40%; DeepSeek V4 Pro – 38%; Gemma 4 31B – 37%; Gemini 3.1 Pro Preview – 30%.
Dialogue Turns vs. Score
GPT‑5.5 averages 31 turns for a 46% score, while Gemini 3.1 Pro Preview averages 83 turns yet scores only 30%. The strict scoring penalizes over‑reporting, so extra uncertain entities reduce precision.
Cost per Question
Claude Opus 4.7 – $5.38 per question.
Gemma 4 31B – $0.14 per question (37% score).
GLM‑5.1 – $1.23 per question (40% score).
Gemini 3.5 Flash – $1.70 per question (40% score).
Cost differences become decisive for large‑scale deployments.
Example Question
A public question presents a front‑end failure. The agent inspects alerts to locate the time window, uses tracing and logs to narrow to front‑end traffic, maps the affected services in the topology, and discovers a NetworkPolicy named otel-demo/NetworkPolicy/frontend-block-all-ports that blocks traffic. The correct root‑cause set contains this single NetworkPolicy.
Benchmark Characteristics
The scoring requires full recall; missing any true root cause yields zero, while false positives lower precision. The Stirrup framework ensures reproducibility and fair comparison.
Cost‑Performance Trade‑off
Gemma 4 31B achieves 37% at $0.14 per question, outperforming Gemini 3.1 Pro Preview (30% at $1.23). Claude Opus 4.7 leads in score but costs roughly four times more per query.
Future Extensions
ITBench-AA will add FinOps (cloud cost optimization) and CISO (security compliance) scenarios, covering the core spectrum of enterprise IT management.
References
arXiv: https://arxiv.org/abs/2502.05352
GitHub repository: https://github.com/itbench-hub/ITBench
Dataset page: https://huggingface.co/datasets/ArtificialAnalysis/ITBench-AA
Evaluation page: https://artificialanalysis.ai/evaluations/itbench-aa
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
SuanNi
A community for AI developers that aggregates large-model development services, models, and compute power.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
