google-agents-cli-eval

Installation
Summary

Evaluate ADK agents with metrics, evalsets, and the iterative eval-fix loop.

  • Run evaluations with agents-cli eval run using configurable criteria (tool trajectory, response matching, rubric-based scoring, hallucination detection, safety checks) and match types (EXACT, IN_ORDER, ANY_ORDER)
  • Build evalsets with multi-turn conversation cases, expected tool trajectories, intermediate responses, and session state overrides
  • Iterate through 5-10+ eval-fix cycles: diagnose failures, fix agent instructions or tool logic, rerun, and track progress with task lists
  • Avoid common pitfalls: don't lower thresholds to hide failures, handle extra tool calls with IN_ORDER matching, ensure app name matches directory, and initialize state with callbacks to prevent KeyError crashes
SKILL.md

Agent Evaluation Guide

Requires: agents-cli (uv tool install google-agents-cli) — install uv first if needed.

Scaffolded project? If you used /google-agents-cli-scaffold, you already have agents-cli eval run (chains generate + grade), tests/eval/datasets/, and tests/eval/eval_config.yaml. Start with executing eval run and iterate from there.

Reference Files

File Contents
references/dataset_schema.md Canonical EvaluationDataset schema — all field types, JSON examples for single-turn / multi-turn / multi-agent, common mistakes
references/metrics-guide.md Complete metrics reference — all built-in metrics, match types, custom metrics, judge model config
references/user-simulation.md Dynamic conversation testing — eval dataset synthesize flags, what scenarios are, compatible metrics
references/builtin-tools-eval.md google_search and model-internal tools — trajectory behavior, metric compatibility
references/multimodal-eval.md Multimodal inputs — eval dataset schema, built-in metric limitations, custom evaluator pattern

The Quality Flywheel

Installs
14.2K
GitHub Stars
2.9K
First Seen
Apr 21, 2026
google-agents-cli-eval — google/agents-cli