← Back to Home

Tag: 基准测试 (7 articles)

AI evals are becoming the new compute bottleneck

AI evaluation costs are skyrocketing, with single agent benchmark runs costing tens of thousands of dollars, and their inherent complexity makes them hard to compress, creating a new compute bottleneck for AI development.

Hugging Face Blog · Apr 30, 2026

Inside VAKRA: Reasoning, Tool Use, and Failure Modes of Agents

IBM and HuggingFace introduce the VAKRA benchmark, revealing that current AI agents perform poorly on complex multi-step tasks, with key failure modes including tool-chain planning, parameter passing, and error recovery.

Hugging Face Blog · Apr 15, 2026

LlamaIndex Newsletter 2026-04-14

LlamaIndex launches ParseBench, the first OCR benchmark for AI agents, and demonstrates breakthroughs in structured document understanding and multimodal reasoning, signaling a shift from text extraction to deep semantic comprehension.

LlamaIndex Blog ·

LlamaIndex Newsletter 2026-04-21

LlamaIndex launches ParseBench, the first document OCR benchmark for AI agents, alongside new parsing tools and benchmark results, marking a shift towards quantifiable document intelligence.

LlamaIndex Blog ·