ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic Enterprise IT Tasks — by Artificial Analysis and IBM
The first benchmark for agentic enterprise IT tasks (SRE) reveals that frontier models, including GPT-5.5 and Claude Opus 4.7, score below 50% when diagnosing Kubernetes incidents, highlighting a significant gap between AI capabilities and real-world IT operations.
Key Points
- The first enterprise-focused agentic benchmark, ITBench-AA, is launched, focusing on Site Reliability Engineering (SRE) tasks.
- All frontier models (both closed and open-source) score below 50% on diagnosing Kubernetes incidents, indicating the task's high difficulty.
- Task complexity is high: models must use shell commands to investigate snapshots containing logs, events, and topology to identify root-cause entities.
- Efficiency doesn't always correlate with accuracy: some models (e.g., Gemini 3.1 Pro) take more investigation steps (83 turns) but score lower (30%), likely due to over-analysis leading to false positives.
- This benchmark provides the first standardized tool for enterprises to evaluate AI's practical capability in critical IT operations.
Analysis
Why This Benchmark Matters Now
Over the past two years, the concept of AI Agents has taken the industry by storm, promising to handle everything from coding and analysis to customer service. However, one critical domain has lacked a clear yardstick: enterprise IT operations. Imagine a core application in a large corporation suddenly slowing down or crashing. Site Reliability Engineers (SREs) must act like detectives, sifting through thousands of logs, monitoring metrics, and system topology maps to quickly pinpoint whether it's a server misconfiguration or an exhausted microservice connection pool. This task is extraordinarily complex and demands extreme accuracy—a minor misstep in diagnosis can lead to incorrect fixes and potentially catastrophic failures. Yet, our previous AI evaluations often focused on relatively "clean" programming or Q&A tasks. The launch of ITBench-AA fills this gap by placing AI Agents into a simulated, real-world enterprise IT failure scenario for the first time, testing if they can actually "do the job."
What Does This Benchmark Test and How?
ITBench-AA currently focuses on SRE tasks, specifically diagnosing Kubernetes (a mainstream container orchestration system) incidents. It provides 59 different "incident scene snapshots," each simulating a real failure scenario such as resource quota exhaustion, network partitions, or service connection pool depletion. The AI Agent (running in a unified sandbox environment called Stirrup) must act like a human engineer, using shell commands to inspect logs, trace dependencies, and analyze event sequences, ultimately submitting a diagnostic report identifying the "root-cause entities" (e.g., a specific Deployment or Pod).
The scoring mechanism is stringent, using "average precision at full recall." In simple terms, the AI must identify all true root-cause entities without missing any. If it misses even one, it scores zero for that task. If it identifies all correctly, its score is then calculated based on the precision of its submission—how many of the entities it flagged were actually "culprits" (true positives) versus false alarms. This mechanism mirrors the reality of enterprise operations: overlooking a single critical fault point can cause the entire remediation to fail. Ultimately, all frontier models scored below 50%, with Claude Opus 4.7 leading at 47% and GPT-5.5 at 46%. This demonstrates that even the most powerful AIs struggle with tasks requiring deep reasoning, multi-step investigation, and precise causal attribution.
Trend Insights: Three Deeper Trends Revealed
First, the gap between "lab capabilities" and "battlefield capabilities." AIs score highly on standardized programming tests (like Terminal-Bench) but "fail" ITBench-AA. This indicates that handling structured, well-bounded programming problems is fundamentally different from tackling chaotic, multi-variable, causally-reasoned real-world problems. Enterprise IT operations are a prime example of the latter. For AI to become a true productivity tool, it must bridge this gap.
Second, "brute-force investigation" does not equal "intelligent diagnosis." An interesting finding is that the number of "turns" (investigation steps) executed by a model does not correlate positively with its final score. Gemini 3.1 Pro averaged 83 turns but scored only 30%, while GPT-5.5 used 31 turns and scored 46%. This suggests that, like human experts, efficient diagnosis relies on precise reasoning and hypothesis validation, not aimless information gathering. Over-investigation tends to mistake irrelevant "upstream fault-injection mechanisms" or "co-occurring symptoms" for root causes, generating many false positives. This hints that the core capability of future excellent AI ops agents may lie in "reasoning quality" rather than "action quantity."
Third, open-source models demonstrate competitiveness in specific domains. In the open-source camp, GLM-5.1 (Reasoning version) achieved 40%, on par with Gemini 3.5 Flash (high-effort) and even outperforming the potentially larger-parameter Gemini 3.1 Pro Preview. This shows that in vertical domains, open-source models that are finely tuned or possess stronger reasoning abilities can compete head-to-head with closed-source giants, offering enterprises more diverse and potentially more controllable options.
Practical Value: What Does This Mean for IT Practitioners and AI Developers?
For enterprise IT leaders and SRE teams, ITBench-AA provides a sobering perspective: at this stage, do not expect a general-purpose AI Agent to fully replace human engineers for fault diagnosis. However, it can be viewed as a powerful "driver-assist" tool for preliminary log screening, pattern recognition, or as a simulation environment for training new hires. When procuring or evaluating AI ops tools, asking whether they have passed rigorous tests like ITBench-AA and what their scores are is far more reliable than relying on vendor marketing materials.
For AI developers and researchers, this benchmark points to the next frontier: how to enable Agents to better perform causal reasoning, multi-source information fusion (logs, metrics, topology), and precise entity attribution. Simultaneously, it provides a valuable, near-real testing environment for iterating and improving models. The unified Stirrup testing framework ensures that different models are evaluated under the same standard, promoting fairness in assessment.
Counter-Intuitive Insight: The Most Expensive or Longest-Thinking Model Isn't Always the Best
We often assume that more powerful models (or those given more "thinking time") should perform better. However, the ITBench-AA results defy this intuition. Gemini 3.1 Pro, a preview model, conducted the most protracted investigation (83 turns) but only achieved a low score of 30%. This strongly suggests that in complex problem-solving, "less but refined" thinking is far more effective than "more but scattered" attempts. In an ops context, AI "hallucinations" or "over-association" directly manifest as false positives, which can be highly detrimental. Therefore, when evaluating Agents in the future, besides accuracy, "diagnostic efficiency" and "robustness against interference" may emerge as equally important new metrics.
Analysis generated by BitByAI · Read original English article