The last six months in LLMs in five minutes

Why This Five-Minute Talk Matters In the AI field, information overload is the norm. When Simon Willison—a developer and blogger known for his pragmatism and insight—decided to summarize the last six months of LLM development in five minutes at PyCon US 2026, he did something incredibly valuable: he drew a clear snapshot of the ongoing arms race. This isn't just a news recap; it's a seasoned observer's precise reading of the industry's pulse. The time window he chose (from November 2025 to now)恰好 captured a critical "inflection point," making this talk an excellent entry point for understanding the current AI landscape. The Pelican, the Bicycle, and the Shifting Throne Willison didn't list dry parameters or benchmark scores. Instead, he continued with his famous "pelican riding a bicycle" test. The brilliance of this test lies in its absurdity: pelicans are hard to draw, bicycles are hard to draw, and pelicans can't ride bicycles—a ridiculous task that ensures no AI lab would specifically train a model for it. Thus, it purely tests a model's generalization and creative generation capabilities. By showcasing the SVG pelicans generated by different models (Claude Sonnet 4.5, GPT-5.1, Gemini 3, etc.), he vividly illustrates the core phenomenon: the title of "best model" changed hands five times among the three major players—Anthropic, OpenAI, and Google—in just six months. This is no longer about one company maintaining a持续 lead; it's a fierce competition of taking turns on the throne. Each shift signifies tangible progress in key capabilities like code generation, logical reasoning, or instruction following. Trend Insight: From "Release Cycles" to an "Arms Race" This reveals a deeper trend: the competition for frontier LLMs has shifted from "R&D breakthroughs" to an engineering-driven, rapid-iteration arms race. In the past, major model releases were big events spaced years apart. Now, the competitive rhythm has compressed to a monthly or even weekly basis. November 2025 became an "inflection point" precisely because high-frequency, targeted model updates (like OpenAI's Codex Max optimized for coding) became the norm. The essence of this race is that capability benchmarks are becoming "fleeting." The model capabilities that amaze you today might be surpassed by a competitor next month. For developers and businesses, the takeaway is clear: don't过度 attach yourself to the "currently strongest" label of any single model, because its shelf life is极其 short. The industry's focus is shifting from "who's number one" to "who can fastest translate the latest capabilities into stable, usable productivity." Practical Value: How Should Developers Respond? First, building a model-agnostic abstraction layer is more important than ever. Your application architecture should allow easy switching of underlying models to quickly benefit from the next wave of capability upgrades. Second, focus on domain-specific advances. Willison specifically notes that coding was a key battlefield in this inflection point. If you're a developer, you should deeply体验 the latest code models from each provider (like Claude Opus, GPT-5.1 Codex Max)—they might already solve tasks you thought impossible six months ago. Finally, adopting fun tests like "pelican riding a bicycle" can help you and your team直观地, non-technically understand the "style" and capability boundaries of different models, which is more tangible than just looking at benchmark scores. Counterintuitive Angles One angle that might be overlooked is that this high-speed competition could lead to the失效 of the "best" definition. When model capabilities are neck-and-neck and交替 rapidly, "best" increasingly depends on the specific task, prompt, or even the user's subjective preference (what Willison calls "vibes"). This means that for most applications,追求 the absolute "strongest model" might be a伪命题. Stability, cost, speed, and fit with your own workflow are rapidly gaining importance. This race has no permanent winners, only a constantly moving finish line.