agent-evaluation

Installation

Summary

Comprehensive evaluation framework for designing, building, and monitoring AI agent performance across coding, conversational, research, and computer-use agents.

Covers three grader types (code-based, model-based, human) with trade-offs and best practices for each agent category
Provides an 8-step roadmap from initial task creation through production monitoring, including environment isolation, outcome-focused grading, and saturation detection
Includes benchmarks for major agent types: SWE-bench for coding, WebArena for computer use, τ2-Bench for conversational agents
Offers CI/CD integration patterns, A/B testing templates, and production sampling strategies for real-time quality monitoring

SKILL.md