Qwen3.6-35B-A3B on my laptop drew me a better pelican than Claude Opus 4.7
Simon Willison's famous 'pelican riding a bicycle' benchmark surprisingly shows a locally-run, smaller Alibaba Qwen3.6 model outperforming the cloud-based, massive Claude Opus 4.7 in creative SVG generation, revealing the surprising potential of open-source models for specific tasks.
- Simon Willison's 'pelican riding a bicycle' is a popular, informal test for AI models' visual understanding and generation capabilities.
- A locally-run, 20.9GB quantized Qwen3.6-35B-A3B model on a MacBook outperformed Anthropic's latest cloud-based giant, Claude Opus 4.7, in generating an SVG of a pelican on a bicycle.
- In a follow-up 'flamingo riding a unicycle' test, the Qwen model again showed superior creativity and detail (e.g., adding sunglasses, a bowtie), while Opus's output was comparatively bland.
- This result challenges the assumption that 'bigger models and the cloud are always stronger,' highlighting the competitiveness of open-source, locally-deployable models on specific creative tasks.
The Origin: Why Does a "Silly" Test Spark Discussion Again? Simon Willison's "pelican riding a bicycle" test has become a meme in the AI community. It's a seemingly simple SVG generation task that tests a model's "common sense" and creativity regarding spatial understanding and detail. Today, he compared outputs from Alibaba's latest open-source Qwen3.6-35B-A3B and Anthropic's flagship Claude Opus 4.7. The result was surprising: the quantized Qwen model, running locally on his MacBook Pro M5 via LM Studio and weighing only 20.9GB, produced a pelican with a correctly shaped bicycle frame, clouds in the sky, a prominent pelican pouch, and even a charmingly goofy caption. In contrast, Opus 4.7's image had a completely misshapen bicycle frame, no clouds, and a less pronounced pouch. Even with Opus's "max thinking" mode enabled, a second attempt still fell short. This immediately raises a pointed question: How can an open-source model that fits on your laptop outperform one of the top-tier cloud-based models on a creative task? Breakdown: It's Not About Model Size, But Task Fitness First, we need to dispel a myth: AI model capability is not a simple "bigger is better" equation. Claude Opus 4.7 is undoubtedly an extremely powerful general-purpose model excelling at complex reasoning, long-context understanding, and instruction following. However, the "pelican riding a bicycle" test is, at its core, a highly specialized visual-concept-to-structured-graphics (SVG) generation task. It doesn't require the model to possess encyclopedic world knowledge or perform multi-step logical reasoning. Instead, it demands that the model: 1) accurately understands the typical visual features of a "pelican" and a "bicycle," and 2) translates these features into paths, shapes, and attributes in SVG code. While Qwen3.6-35B-A3B has far fewer total parameters (35B) than Opus (estimated at hundreds of billions), it uses a Mixture-of-Experts (MoE) architecture with only 3B active parameters. This means during inference, it acts more like an efficient "specialized task force" rather than a "giant committee" needing to coordinate a massive internal system. For concrete tasks with clear visual paradigms like "pelican" or "flamingo on a unicycle," this "task force" might produce more stable and creative results because its training data is more focused and its architecture more efficient. Opus's failure exposes the potential "bluntness" that even colossal general-purpose models can exhibit when tackling certain "small" tasks. Trend Insight: Open-Source Models Are Defining Their Own Battlegrounds This incident reveals a deeper trend: AI competition is shifting from "parameter hegemony" to "scenario efficacy." In the past, we assumed the most powerful AI must be in the cloud, monopolized by giants. But models like Qwen3.6 demonstrate that through architectural innovations (like MoE) and sophisticated quantization techniques, the open-source community can deliver a model with sufficient performance—one that also protects data privacy—running smoothly on a consumer's hardware. When this local model performs better or more reliably than cloud giants on a specific task (like generating SVGs in a particular style, writing code in a specific format, or processing local documents), the calculus for users and developers changes. Cloud models offer "all-powerful but general" capabilities, while local/open-source models can pursue "specialized and controllable" experiences. In the future, we may see more "small but beautiful" models defeating "large and all-encompassing" giants in niche domains, much like a smartphone camera surpassing a professional DSLR in a specific function. Practical Value and a Counter-Intuitive Angle For developers and product builders, the practical value of this case is clear: Don't blindly trust the largest model. If a core function in your product (like content moderation, generating a specific format, or simple Q&A) is highly repetitive and well-defined, trying to replace expensive cloud API calls with a smaller, faster, locally deployable open-source model could yield huge benefits in cost, latency, and privacy. You need to design a "pelican test" for the critical tasks in your product, just like Simon did, to evaluate the real-world performance of different models in your scenario. A counter-intuitive angle that most people might overlook is this: Sometimes, a "dumber" model is an advantage. Super-large models like Opus, having undergone extremely complex alignment and safety training, may have developed certain "cognitive ruts." When generating content that seems absurd or wildly imaginative (like giving a pelican a goofy pouch), they might become overly conservative and "correct" due to "overthinking," losing that raw creative spark. Smaller models, with fewer constraints, can sometimes produce more interesting and unexpected results. This suggests that when evaluating models, beyond accuracy, a metric like "creative entropy" or "surprise factor" could become a new, important dimension.
Analysis by BitByAI · Read original