Better Models: Worse Tools

Newer Claude models are increasingly making mistakes when calling third-party edit tools, likely because Anthropic over-trained them on Claude Code's own tool syntax, degrading general tool-use ability and highlighting platform lock-in risks in AI training.

ai-agent Large Language Models 工具调用 Developer Tools 平台锁定编码智能体

KEY POINTS

Newer Claude models (Opus 4.8, Sonnet 5) invent extra fields when calling Pi's custom edit tool, while older models do not
Likely cause: Anthropic fine-tuned models via RL specifically for Claude Code's own edit tool, causing overfitting to that schema
Third-party coding harnesses like Pi suffer, potentially forcing developers to adopt model-specific tool definitions
This reveals a deeper tension where model improvement can sacrifice generality, creating hidden platform lock-in

ANALYSIS

The Trigger: A Counterintuitive Discovery

Simon Willison shared a finding from Armin Ronacher: while hacking on his AI coding harness Pi, he noticed a baffling issue. Newer Claude models (Opus 4.8 and Sonnet 5) were inventing extra fields when calling Pi’s custom edit tool, causing the tool call to fail. Yet older models, even the smaller Haiku, didn't make this mistake.

It’s completely counterintuitive: aren’t newer models supposed to be universally better? How can they regress on a specific tool?

The Breakdown: Why Are ‘Better’ Models Worse at Following Tool Schemas?

Let’s rewind. AI models interact with external systems through tool calls. Developers define a schema (e.g., an edit tool expects path, old_text, new_text) and the model generates JSON matching that schema. Historically, tool-use ability was generic: you described the tool, and the model faithfully produced the required parameters.

But Armin found that the new Claude models were adding fields like explanation or reasoning — fields not present in Pi’s tool definition. It’s as if the models operated on autopilot, outputting JSON in a style they learned elsewhere, completely ignoring the actual specification.

Armin’s hypothesis makes sense: Anthropic likely fine-tuned these models (via reinforcement learning) to excel at Claude Code’s built-in edit tool, which uses its own proprietary fields like search and replace. So when the model encounters a different tool also named “edit,” it unconsciously applies the patterns learned from its home platform. In essence, the model has become overly dialect-specific; its “Mandarin” tool-use has degraded because it’s been cramming “Claude Code-ese.”

Trend Insight: This Isn’t Just a Bug—It’s a Sign of Platform Lock-in

This incident points to a larger shift: AI models are moving from general-purpose agents to platform-optimized helpers. History repeats itself. Remember when Internet Explorer’s deep Windows integration led many websites to optimize only for IE, breaking on other browsers? Similarly, AI providers now have strong incentives to make their models perform best on their own platforms, using bespoke training and tool formats to create a moat.

For developers, it means you can’t just plug-and-play different models. If you fine-tune your tool descriptions for Claude, Gemini might perform poorly, and vice versa. Tools like Pi might even need to maintain multiple tool definitions and switch based on the model in use. This increases maintenance overhead and erodes the dream of model interchangeability.

At a deeper level, the degradation exposes a flaw in current AI training. When we use reinforcement learning to push performance on a specific task, we risk hurting generalization. Reward-driven optimization teaches the model patterns that yield high scores, but if the reward focuses narrowly on in-house formats, the model can lose respect for arbitrary schemas. It’s like a student drilled to solve quadratic equations one way; given a linear equation, they still try to apply the quadratic formula and mess up.

Practical Takeaways: What Should Developers Do Now?

Question the assumption that ‘newer is always better.’ Don’t blindly upgrade. Run regression tests on your actual workflows, especially tool calling and schema-dependent code generation. A model might ace public benchmarks but fail on your specific use case.
Consider model-specific tool adapters. A pragmatic path is to write slightly different tool schemas for each model family, or strengthen prompts with “strictly follow the tool definition, do not add extra fields.” Though RL-trained models may resist. Long term, we might see middleware that translates unified tool calls into each model’s preferred dialect.
Push for tool-call standardization. OpenAI’s function-call format has become a de facto standard, but vendor fine-tuning threatens to fragment it. The community could demand stricter schema-adherence benchmarks, or ask providers to disclose their tool training data so developers know when to trust a model.

The Surprise: Why Older Models Are More Reliable

Many assume model upgrades bring uniform improvements, but this story shows that regressions can hide in the blind spots of benchmarks. Without specialized tool training, older models simply stick to the provided schema because they lack a learned “dialect” to fall back on. Newer models, with their vast platform-specific experience, are prone to overstep. It’s like human experts who sometimes stumble on simple tasks — their ingrained habits override careful reading.

Once again, it’s a reminder that AI has no silver bullet. Progress always involves trade-offs. The best we can do is stay clear-eyed, skeptical of any single metric, and verify on our own turf.

Analysis by BitByAI · Read original

Originally from Simon Willison · Analyzed by BitByAI