ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic Enterprise IT Tasks — by Artificial Analysis and IBM
The first benchmark for agentic enterprise IT tasks (SRE) reveals that frontier models, including GPT-5.5 and Claude Opus 4.7, score below 50% when diagnosing Kubernetes incidents, highlighting a significant gap between AI capabilities and real-world IT operations.
Hugging Face Blog · May 28, 2026