Tag: 模型评测 (3 articles)

The last six months in LLMs in five minutes

Simon Willison uses his 'pelican riding a bicycle' test to vividly recap how the 'best model' crown changed hands five times among three major providers in six months, revealing the industry's new phase of rapid-iteration arms race.

Simon Willison · May 19, 2026

Qwen3.6-35B-A3B on my laptop drew me a better pelican than Claude Opus 4.7

Simon Willison's famous 'pelican riding a bicycle' benchmark surprisingly shows a locally-run, smaller Alibaba Qwen3.6 model outperforming the cloud-based, massive Claude Opus 4.7 in creative SVG generation, revealing the surprising potential of open-source models for specific tasks.

Simon Willison · Apr 17, 2026

Introducing Claude Opus 4.8

Anthropic releases Claude Opus 4.8, with core breakthroughs in significantly improving the reliability, judgment, and long-running consistency of Agent tasks, marking AI's practical shift from 'usable' to 'trustworthy'.

Anthropic News ·