Tag: 基准测试 (9 articles)

Patterns for Building Cybersecurity Evals

This article breaks down the four core components of cybersecurity evaluations and introduces multi-level tasks for more granular measurement of AI's offensive and defensive capabilities.

Eugene Yan · Jun 21, 2026

MosaicLeaks: Can your research agent keep a secret?

Deep research agents combining internal and web data leak secrets through query logs; a new benchmark and privacy-aware RL training provide metrics and solutions.

Hugging Face Blog · Jun 19, 2026

AI evals are becoming the new compute bottleneck

AI evaluation costs are skyrocketing, with single agent benchmark runs costing tens of thousands of dollars, and their inherent complexity makes them hard to compress, creating a new compute bottleneck for AI development.

Hugging Face Blog · Apr 30, 2026

Inside VAKRA: Reasoning, Tool Use, and Failure Modes of Agents

IBM and HuggingFace introduce the VAKRA benchmark, revealing that current AI agents perform poorly on complex multi-step tasks, with key failure modes including tool-chain planning, parameter passing, and error recovery.

Hugging Face Blog · Apr 15, 2026

Adding Benchmaxxer Repellant to the Open ASR Leaderboard

Hugging Face introduces private speech datasets to prevent 'benchmaxxing' on public test sets, aiming to make the ASR leaderboard a more truthful reflection of real-world model robustness.

Hugging Face Blog ·

Introducing ParseBench: The First Document Parsing Benchmark for AI Agents

LlamaIndex releases ParseBench, the first document parsing benchmark for AI agents, evaluating parsers across five dimensions like tables and charts, revealing no single method excels at everything, with LlamaParse Agentic showing the most balanced performance.

LlamaIndex Blog ·

ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic Enterprise IT Tasks — by Artificial Analysis and IBM

IBM and Artificial Analysis release the first benchmark for agentic enterprise IT tasks, showing that top models like GPT-5.5 and Claude Opus 4.7 score below 50% on Kubernetes incident diagnosis, highlighting the significant gap for AI in complex, real-world enterprise scenarios.

Hugging Face Blog ·

LlamaIndex Newsletter 2026-04-14

LlamaIndex launches ParseBench, the first OCR benchmark for AI agents, and demonstrates breakthroughs in structured document understanding and multimodal reasoning, signaling a shift from text extraction to deep semantic comprehension.

LlamaIndex Blog ·

LlamaIndex Newsletter 2026-04-21

LlamaIndex launches ParseBench, the first document OCR benchmark for AI agents, alongside new parsing tools and benchmark results, marking a shift towards quantifiable document intelligence.

LlamaIndex Blog ·