← Back to Home

How we build evals for Deep Agents

LangChain Blog Agent框架 进阶 Impact: 8/10

LangChain shares its core philosophy for building AI agent evaluation systems: more evals aren't better; instead, precisely define and measure the agent behaviors you care about to guide its evolution.

Key Points

  • Evals are 'vectors' that shape agent behavior; blindly adding more creates a false sense of security
  • The core method is to first define key behaviors in production, then design verifiable eval tasks
  • Continuously discover and supplement evals through daily dogfooding, trace analysis, and adapting external benchmarks
  • Evals should be categorized (e.g., file_operations, tool_use) for intermediate performance insights, not just a single score

Analysis

Have you ever run hundreds of tests on your AI agent, achieved a high score, only to see it fail spectacularly in production? A recent LangChain blog post hits this痛点 directly—the core of their evaluation system for Deep Agents isn't about chasing test quantity, but about achieving precise alignment between test quality and desired agent behavior. This matters because as AI agents evolve from simple Q&A bots to systems capable of executing complex, multi-step tasks (like automatically fixing code bugs), traditional evaluation methods fall short. You can't measure a task requiring multiple tool calls and file reads with a simple 'accuracy' metric. LangChain's philosophy is that every evaluation case acts as a 'vector' guiding agent behavior. When you tweak a system prompt because a 'efficient file reading' eval failed, you're applying behavioral pressure. Thus, the design of your evals directly shapes what your agent becomes. Their methodology can be broken down into three steps. The first step is 'define the behaviors you want.' This sounds simple, but many teams skip it and jump straight to collecting test cases. LangChain first梳理 the most important capabilities for production, like 'retrieving content across multiple files' or 'accurately composing 5+ tool calls in sequence.' The second step is 'document each eval.' Each test shouldn't just run; it should have a docstring clearly explaining how it measures a specific capability. This ensures evals are self-explanatory, so team members understand their intent instead of facing a black box of tests. The third step is 'categorize and run.' They tag evals by the capability they test (e.g., file_operations, tool_use), not by their source (e.g., 'from BFCL benchmark'). This provides an intermediate-granularity performance view, showing how the agent performs overall on tasks like 'file operations,' rather than just a single aggregate score. For data sourcing, they employ a pragmatic 'trinity' strategy. The most crucial is 'dogfooding'—the team uses their own agents (like the open-source coding assistant Open SWE) daily. Every error encountered becomes a new eval case to ensure the same mistake doesn't recur.其次是 'adapt and adopt,' where they select relevant tasks from external benchmarks like Terminal Bench 2.0 and BFCL, then tailor them to their agent's specific context.最后是 'handcraft,' writing dedicated unit tests for isolated behaviors they deem important but uncovered by existing benchmarks (like testing the read_file tool). A key counter-intuitive insight is that more evals do not equal a better agent. Blindly adding大量 tests might just optimize a metric脱节 from your real production needs, creating an illusion of 'score inflation.'相反, precise, targeted evals, though fewer in number, more effectively drive agent improvements in real-world scenarios while saving significant model invocation costs. This reveals a deeper trend: AI engineering is shifting from a 'model capability race' to 'system behavior engineering.' Evaluation is no longer a post-development acceptance step, but a core engineering practice贯穿 the entire development lifecycle, continuously shaping system behavior. For developers, this意味着 a mindset shift—instead of chasing the latest benchmark leaderboards, dive deep into understanding what your users actually need the agent to do, then build your 'check-up sheet' around those specific behaviors. Your evaluation system is the blueprint for your agent's capabilities.

Analysis generated by BitByAI · Read original English article

Originally from LangChain Blog

Automatically analyzed by BitByAI AI Editor

BitByAI — AI-powered, AI-evolved AI News