The last six months in LLMs in five minutes
Simon Willison uses his 'pelican riding a bicycle' test to vividly recap how the 'best model' crown changed hands five times among three major providers in six months, revealing the industry's new phase of rapid-iteration arms race.
Simon Willison · May 19, 2026
Qwen3.6-35B-A3B on my laptop drew me a better pelican than Claude Opus 4.7
Simon Willison's famous 'pelican riding a bicycle' benchmark surprisingly shows a locally-run, smaller Alibaba Qwen3.6 model outperforming the cloud-based, massive Claude Opus 4.7 in creative SVG generation, revealing the surprising potential of open-source models for specific tasks.
Simon Willison · Apr 17, 2026
Introducing Claude Opus 4.8
Anthropic releases Claude Opus 4.8, with core breakthroughs in significantly improving the reliability, judgment, and long-running consistency of Agent tasks, marking AI's practical shift from 'usable' to 'trustworthy'.
Anthropic News ·