Using DSPy to evaluate and improve Datasette Agent's SQL system prompts

Simon Willison used DSPy to automatically evaluate and improve Datasette Agent's SQL prompts, uncovering hidden flaws like column-name guessing and highlighting the shift from manual prompt tuning to scientific iteration.

提示工程 DSPy 智能体评估 SQL Datasette

KEY POINTS

An automated evaluation pipeline (harness + gold dataset + custom metrics) systematically exposes error loops caused by ambiguous instructions in the original prompt
A seemingly sensible advice ('don’t call describe_table if you already have the info') backfired, making the model guess column names and triggering retries — less is not always more
Giving the model full schema (including column names) upfront proved more effective than token-saving workarounds, confirming that information completeness trumps instruction cleverness
Tools like DSPy are turning prompt engineering from manual trial-and-error into data-driven iteration — future AI application prompts should be CI-tested like code

ANALYSIS

Over the weekend, well-known developer Simon Willison conducted an interesting experiment: he gave the SQL prompt for his own Datasette Agent a “check-up” using the DSPy framework. The result uncovered a head-scratching bug — a well-intentioned piece of advice in the system prompt led the model straight into a ditch.

Origin: the urge to move from manual tuning to automation Datasette Agent is a read-only SQL question-answering tool built by Simon: users ask questions in natural language, and the AI generates and executes SQL. At its core is a carefully crafted system prompt that tells the model which tables exist in the database and how to use tools. But Simon always had a nagging thought: Is this prompt truly optimal? Are there hidden pitfalls? After seeing a talk on DSPy at the AIE conference, he decided to let AI tune AI’s own prompts.

Dissection: building a “CI for prompts” with DSPy Simon’s approach is worth emulating: he set up an evaluation harness where DSPy directly invokes the actual tool implementations and prompts of Datasette Agent, connects to an in-memory database, and auto-generates a set of “golden Q&A” as a test suite. Crucially, he defined custom metrics — such as SQL execution success rate, answer accuracy — rather than relying on human judgment. After one round, the problem surfaced immediately. The original prompt included the line: “If you already got the information from the schema, don’t call the describe_table function.” The intent was good: save tokens, reduce calls. But the schema given in the prompt only listed table names, not column names! So the model, trying to follow the advice, started guessing: this table probably has a column called page_count? That table might have o.order_id? Consequently, SQL errors occurred, the model retried, and got stuck in a loop. This is like telling an assistant, “The map already shows the roads, so don’t keep asking me for directions,” but the map didn’t actually draw any roads — the assistant can only wander blindly.

Trend: prompt engineering is becoming a hard engineering discipline This case reveals a bigger shift: system prompts for AI applications are no longer write-once-and-forget. They can degrade as models upgrade, data changes, or usage scenarios drift. In the past, we relied on experience and trial-and-error; now, we can use tools like DSPy or LangSmith to build an evaluation pipeline, running a regression test suite with every modification. Essentially, it’s applying Continuous Integration (CI) from software engineering to prompts. You’ll also find that many so-called “prompt tricks” don’t stand up to testing — only data-driven iteration can polish a truly reliable product.

Practical value: how can we use this? If your AI application also involves complex system prompts, you can borrow Simon’s process:

Build a test set: even a handful of handcrafted examples is better than pure eye-balling.
Define metrics: don’t just look at how “human-like” the answer sounds; check whether the task succeeded, whether retries happened, whether side effects occurred.
Perform ablation studies: delete or rephrase a sentence in the prompt and observe the metric changes. For instance, if Simon softened the “don’t call describe_table” advice or changed it to “first fetch the complete schema,” the issue would be alleviated.
Don’t fetishize “less is more”: for structured information, providing sufficient context often beats clever instructions. Column names, types, descriptions — give them when needed.

Counter-intuition: a well-meaning tip can be a trap Many of us (including me) like to add a bunch of “don’t do this” or “don’t do that” constraints when writing prompts, thinking it will reduce errors. But this case teaches us that incomplete context + a strong negative command = the model is forced to make wild guesses. It’s better to provide full information and let the model decide. Whenever you are about to say “don’t do X,” first check whether the prerequisites for doing X are clear.

Simon ended by noting he particularly liked one of the findings: “Either include the column names in the prompt’s schema listing, or soften that advice.” It seems trivial, but it’s exactly this shift from “gut feel” to “data-informed” that gives AI product quality a proper engineering flavor. The next time your model acts silly, try giving its prompt a “CI check.”

Analysis by BitByAI · Read original

Originally from Simon Willison · Analyzed by BitByAI