Learn Agent Eval
The agent eval vocabulary. These are the concepts that define the category — coined by Iris, grounded in practice, and backed by research.
Agent Eval: The Definitive Guide
The complete reference for evaluating AI agent outputs — methodologies, implementation patterns, vocabulary, and code examples. Start here.
The Vocabulary
Eval Drift
Quality degradation over time as models and prompts change.
The Eval Tax
The compounding cost of not evaluating agent outputs.
The Eval Gap
The distance between demo performance and production reality.
Eval Coverage
The percentage of agent executions being evaluated.
Eval-Driven Development
Write your eval rules before you write your prompt. TDD for agents.
The Eval Loop
The continuous cycle: score, diagnose, calibrate, re-score.
Self-Calibrating Eval
Eval systems that monitor their own scoring distribution and recommend adjustments.
Output Quality Score
A composite metric that rolls completeness, relevance, safety, and cost into one number.