Learn Agent Eval

The agent eval vocabulary. These are the concepts that define the category — coined by Iris, grounded in practice, and backed by research.

Agent Eval: The Definitive Guide

The complete reference for evaluating AI agent outputs — methodologies, implementation patterns, vocabulary, and code examples. Start here.

Quality degradation over time as models and prompts change.

The compounding cost of not evaluating agent outputs.

The distance between demo performance and production reality.

The percentage of agent executions being evaluated.

Write your eval rules before you write your prompt. TDD for agents.

The continuous cycle: score, diagnose, calibrate, re-score.

Eval systems that monitor their own scoring distribution and recommend adjustments.

A composite metric that rolls completeness, relevance, safety, and cost into one number.