How we build evals for Deep Agents

The LangChain team shares their core philosophy for building AI agent evals: more tests don't mean better agents; the key is designing targeted, self-documenting evaluations that directly measure desired behaviors.

AI Agent 评估方法 Large Language Models Developer Tools 软件工程

KEY POINTS

Evals are vectors that shape agent behavior, not just pass/fail metrics
Core principle is 'targeted evals', opposing blindly adding hundreds of tests
Eval data sources include internal dogfooding feedback, adapted external benchmarks, and hand-written tests
Each eval should have a docstring explaining what it measures and be categorized
Using traces to analyze failure modes is key to iterating on evals and the agent

ANALYSIS

The Context: Why Rethink Agent Evals Now? As AI Agents evolve from simple, single-step tools into 'Deep Agents' capable of complex, multi-step tasks, traditional evaluation methods fall short. The LangChain team, while building their open-source agent framework and products like Open SWE, discovered that blindly chasing eval quantity only creates an 'illusion of improvement' without genuinely boosting reliability in production. This article's core is sharing their engineering shift from 'stacking tests' to 'designing effective evaluations.'

The Breakdown: What Makes an Agent Eval 'Effective'? LangChain offers a key insight: Every eval is a 'vector' that continuously applies pressure to your agent system, thereby shaping its final behavior. For instance, if you have an eval testing 'efficient file reading' and it fails, you'll tweak the system prompt or tool description until it passes. Thus, the design of evals directly dictates how your agent evolves.

They argue against the simple logic of 'more evals = better agents.' Instead, they advocate for 'targeted evals':

Define Target Behaviors First: Start by clarifying the specific behaviors you expect your agent to exhibit in production (e.g., retrieving content across multiple files, accurately composing 5+ tool calls in sequence).
Then Build Measurement Tools: For each target behavior, design one or more verifiable, targeted evaluations. This is like writing a dedicated unit test for every critical 'skill.'
Make Evals Self-Explanatory: Add a docstring to each eval that clearly explains how it measures a capability. This ensures the eval is self-documenting, so team members understand its intent.
Categorize for Management: Tag evals (e.g., tool_use) to run and analyze them by category, gaining a more insightful 'middle view' than a single aggregate score.

Trend Insight: Eval Engineering as Core AI Infrastructure This article reveals a deeper trend: Evals are evolving from a post-hoc checklist item into a core engineering practice that drives AI system development. They are no longer just a yardstick for model quality, but a design tool that actively guides and constrains agent behavior. This marks a new phase of AI engineering—we're not just 'building' agents, but learning how to 'define' and 'validate' their intelligent behaviors. LangChain's systematic approach to sourcing eval data (internal dogfooding, adapted external benchmarks, hand-written tests) and using traces with specialized analysis agents (like Polly) to analyze failure modes at scale exemplifies a mature practice of Eval Engineering.

Practical Value: What Can You Take Away? For developers building or using AI Agents, this share is highly actionable:

Shift Your Mindset: Stop adding vague tests for the sake of 'coverage.' First, ask yourself: What are the critical, non-negotiable behaviors for my agent in production? Then, design targeted evals for those behaviors.
Establish a Process: Treat 'dogfooding' (internal daily use) as a goldmine for discovering eval needs. Every production error is a candidate for a new eval.
Leverage Tools: Use observability tools like LangSmith to trace every eval run. Analyze failure trajectories to understand root causes, rather than just looking at scores. This makes your fixes and iterations more targeted.
Control Costs: Targeted evals not only improve quality but also save significant compute costs compared to running full benchmark suites.

Counterintuitive/Unexpected Angle A potentially counterintuitive point is: SDK unit and integration tests (e.g., checking if the system prompt is passed through, interrupt config is correct) should be strictly separated from model capability evals. Because any model should pass these foundational tests, mixing them into eval scores introduces noise and fails to reflect the model's true intelligence level. Evals should focus on measuring differentiating behaviors that depend on model capability.

Analysis by BitByAI · Read original

Originally from LangChain Blog · Analyzed by BitByAI