v0.1Iris — The agent eval standard for MCP. 12 eval rules, open source

Learn Agent Eval

The agent eval vocabulary. These are the concepts that define the category — coined by Iris, grounded in practice, and backed by research.

Cornerstone Guide

Agent Eval: The Definitive Guide

The complete reference for evaluating AI agent outputs — methodologies, implementation patterns, vocabulary, and code examples. Start here.

The Vocabulary

Eval Drift

Quality degradation over time as models and prompts change.

The Eval Tax

The compounding cost of not evaluating agent outputs.

The Eval Gap

The distance between demo performance and production reality.

Eval Coverage

The percentage of agent executions being evaluated.

Eval-Driven Development

Write your eval rules before you write your prompt. TDD for agents.

The Eval Loop

The continuous cycle: score, diagnose, calibrate, re-score.

Self-Calibrating Eval

Eval systems that monitor their own scoring distribution and recommend adjustments.

Output Quality Score

A composite metric that rolls completeness, relevance, safety, and cost into one number.