ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic Enterprise IT Tasks — by Artificial Analysis and IBM
IBM and Artificial Analysis release the first benchmark for agentic enterprise IT tasks, showing that top models like GPT-5.5 and Claude Opus 4.7 score below 50% on Kubernetes incident diagnosis, highlighting the significant gap for AI in complex, real-world enterprise scenarios.
Hugging Face Blog ·