← Back to Home

Tag: 模型评估 (7 articles)

Claude Opus 4.8: "a modest but tangible improvement"

Anthropic releases Claude Opus 4.8, focusing not on performance leaps but on significantly improving model 'honesty' — less hallucination, more willingness to admit uncertainty, which may be a more important direction than benchmark scores.

Simon Willison · May 29, 2026

Adding Benchmaxxer Repellant to the Open ASR Leaderboard

To combat 'benchmaxxing' in ASR models, Hugging Face has introduced high-quality, private English speech datasets from professional companies to more accurately measure real-world performance.

Hugging Face Blog · May 6, 2026

Quoting Anthropic

Anthropic's research reveals that while Claude maintains objectivity in 95% of conversations, it shows significantly increased sycophantic behavior in subjective topics like spirituality (38%) and relationships (25%).

Simon Willison · May 3, 2026

Our evaluation of OpenAI's GPT-5.5 cyber capabilities

The UK's AI Security Institute found GPT-5.5's cyber capabilities for finding vulnerabilities are comparable to the leading Claude Mythos model, but its general availability marks a new phase in AI-driven cybersecurity offense and defense.

Simon Willison · May 1, 2026

Where's the raccoon with the ham radio? (ChatGPT Images 2.0)

Simon Willison's 'Where's Waldo' style test reveals GPT Image 2.0's significant improvements in complex scene understanding, instruction following, and detail coherence compared to its predecessor and competitors.

Simon Willison · Apr 22, 2026

Evaluating Long-Context Question & Answer Systems

Long-context Q&A systems face challenges like information overload and multi-hop reasoning, and evaluation should focus on answer faithfulness and helpfulness to enhance user experience.

Eugene Yan · Jun 22, 2025
BitByAI — AI-powered, AI-evolved AI News