First Enterprise IT Ops Agent Benchmark Shows Claude Leads with Just 47% Score

The ITBench-AA benchmark, the first evaluation specifically for enterprise IT operations agents, tests 59 SRE scenarios and reveals that even top models like Claude Opus 4.7 achieve only a 47% score, highlighting both the difficulty of the tasks and the cost‑effectiveness gap between proprietary and open‑source agents.

SuanNi
SuanNi
SuanNi
First Enterprise IT Ops Agent Benchmark Shows Claude Leads with Just 47% Score

ITBench-AA Overview

ITBench-AA is a benchmark for enterprise‑level IT operations agents, built on IBM’s ITBench dataset. It targets Site Reliability Engineering (SRE) tasks that require understanding of Kubernetes resources, Prometheus alerts, distributed tracing, and micro‑service topologies.

Dataset and Task Definition

59 SRE questions (40 public, 19 hidden to prevent memorization).

Each question provides a Kubernetes fault snapshot containing alerts, events, traces, metrics, logs, and the application topology.

The agent must output the minimal set of independent root‑cause entities (e.g., Deployment, Service, Pod, NetworkPolicy).

Scoring Method

Scoring uses all‑recall average precision. If any true root cause is missing, the question score is zero; otherwise precision = correct entities ÷ total submitted entities. The final score is the average over three repeated runs for each question.

Evaluation Harness

All models run on the open‑source Stirrup harness. Each task is limited to 100 dialogue turns and executed three times to smooth randomness. Agents interact with a sandboxed file system via shell commands and submit a structured JSON diagnosis.

Results

Claude Opus 4.7 (adaptive reasoning, max effort) – 47%.

GPT‑5.5 – 46%.

Qwen 3.7 Max – 42%.

Open‑source: GLM‑5.1 – 40%; Gemini 3.5 Flash – 40%; DeepSeek V4 Pro – 38%; Gemma 4 31B – 37%; Gemini 3.1 Pro Preview – 30%.

Dialogue Turns vs. Score

GPT‑5.5 averages 31 turns for a 46% score, while Gemini 3.1 Pro Preview averages 83 turns yet scores only 30%. The strict scoring penalizes over‑reporting, so extra uncertain entities reduce precision.

Cost per Question

Claude Opus 4.7 – $5.38 per question.

Gemma 4 31B – $0.14 per question (37% score).

GLM‑5.1 – $1.23 per question (40% score).

Gemini 3.5 Flash – $1.70 per question (40% score).

Cost differences become decisive for large‑scale deployments.

Example Question

A public question presents a front‑end failure. The agent inspects alerts to locate the time window, uses tracing and logs to narrow to front‑end traffic, maps the affected services in the topology, and discovers a NetworkPolicy named otel-demo/NetworkPolicy/frontend-block-all-ports that blocks traffic. The correct root‑cause set contains this single NetworkPolicy.

Benchmark Characteristics

The scoring requires full recall; missing any true root cause yields zero, while false positives lower precision. The Stirrup framework ensures reproducibility and fair comparison.

Cost‑Performance Trade‑off

Gemma 4 31B achieves 37% at $0.14 per question, outperforming Gemini 3.1 Pro Preview (30% at $1.23). Claude Opus 4.7 leads in score but costs roughly four times more per query.

Future Extensions

ITBench-AA will add FinOps (cloud cost optimization) and CISO (security compliance) scenarios, covering the core spectrum of enterprise IT management.

References

arXiv: https://arxiv.org/abs/2502.05352

GitHub repository: https://github.com/itbench-hub/ITBench

Dataset page: https://huggingface.co/datasets/ArtificialAnalysis/ITBench-AA

Evaluation page: https://artificialanalysis.ai/evaluations/itbench-aa

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

SREOpen-sourceAI AgentbenchmarkClaudeCost EfficiencyIT Operations
SuanNi
Written by

SuanNi

A community for AI developers that aggregates large-model development services, models, and compute power.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.