google-agents-cli-eval

Installation

Summary

Evaluate ADK agents with metrics, evalsets, and the iterative eval-fix loop.

Run evaluations with agents-cli eval run using configurable criteria (tool trajectory, response matching, rubric-based scoring, hallucination detection, safety checks) and match types (EXACT, IN_ORDER, ANY_ORDER)
Build evalsets with multi-turn conversation cases, expected tool trajectories, intermediate responses, and session state overrides
Iterate through 5-10+ eval-fix cycles: diagnose failures, fix agent instructions or tool logic, rerun, and track progress with task lists
Avoid common pitfalls: don't lower thresholds to hide failures, handle extra tool calls with IN_ORDER matching, ensure app name matches directory, and initialize state with callbacks to prevent KeyError crashes

SKILL.md

Agent Evaluation Guide

Requires: agents-cli (uv tool install google-agents-cli) — install uv first if needed.

Scaffolded project? If you used /google-agents-cli-scaffold, you already have agents-cli eval run (chains generate + grade), tests/eval/datasets/, and tests/eval/eval_config.yaml. Start with executing eval run and iterate from there.

Reference Files

File	Contents
`references/dataset_schema.md`	Canonical EvaluationDataset schema — all field types, JSON examples for single-turn / multi-turn / multi-agent, common mistakes
`references/metrics-guide.md`	Complete metrics reference — all built-in metrics, match types, custom metrics, judge model config
`references/user-simulation.md`	Dynamic conversation testing — `eval dataset synthesize` flags, what scenarios are, compatible metrics
`references/builtin-tools-eval.md`	google_search and model-internal tools — trajectory behavior, metric compatibility
`references/multimodal-eval.md`	Multimodal inputs — eval dataset schema, built-in metric limitations, custom evaluator pattern

The Quality Flywheel

Installs

14.2K

Repository

google/agents-cli

GitHub Stars

2.9K

First Seen

Apr 21, 2026

Security Audits

Gen Agent Trust HubPass

SocketPass

SnykWarn