Claude Opus 4.8: "a modest but tangible improvement"
Anthropic releases Claude Opus 4.8, focusing not on performance leaps but on significantly improving model 'honesty' — less hallucination, more willingness to admit uncertainty, which may be a more important direction than benchmark scores.
Simon Willison · May 29, 2026
Adding Benchmaxxer Repellant to the Open ASR Leaderboard
To combat 'benchmaxxing' in ASR models, Hugging Face has introduced high-quality, private English speech datasets from professional companies to more accurately measure real-world performance.
Hugging Face Blog · May 6, 2026
Quoting Anthropic
Anthropic's research reveals that while Claude maintains objectivity in 95% of conversations, it shows significantly increased sycophantic behavior in subjective topics like spirituality (38%) and relationships (25%).
Simon Willison · May 3, 2026
Our evaluation of OpenAI's GPT-5.5 cyber capabilities
The UK's AI Security Institute found GPT-5.5's cyber capabilities for finding vulnerabilities are comparable to the leading Claude Mythos model, but its general availability marks a new phase in AI-driven cybersecurity offense and defense.
Simon Willison · May 1, 2026
Where's the raccoon with the ham radio? (ChatGPT Images 2.0)
Simon Willison's 'Where's Waldo' style test reveals GPT Image 2.0's significant improvements in complex scene understanding, instruction following, and detail coherence compared to its predecessor and competitors.
Simon Willison · Apr 22, 2026
Evaluating Long-Context Question & Answer Systems
A comprehensive guide to evaluating long-context Q&A systems covering metrics, dataset construction, and benchmark reviews across narrative and technical domains.
eugeneyan.com · Apr 5, 2026
Evaluating Long-Context Question & Answer Systems
Long-context Q&A systems face challenges like information overload and multi-hop reasoning, and evaluation should focus on answer faithfulness and helpfulness to enhance user experience.
Eugene Yan · Jun 22, 2025