故障诊断 — Tag

ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic Enterprise IT Tasks — by Artificial Analysis and IBM

IBM and Artificial Analysis release the first benchmark for agentic enterprise IT tasks, showing that top models like GPT-5.5 and Claude Opus 4.7 score below 50% on Kubernetes incident diagnosis, highlighting the significant gap for AI in complex, real-world enterprise scenarios.

Hugging Face Blog ·

Tag: 故障诊断 (1 articles)

ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic Enterprise IT Tasks — by Artificial Analysis and IBM