adk-evals
ADK Evals Skill
What are Evals?
Evals are automated conversation tests for ADK agents. Each eval defines a scenario — a sequence of user messages or events — and asserts on what the bot should do: what it says, which tools it calls, how state changes, what gets written to tables, and more.
Evals run against a live dev bot (adk dev), so they test the full stack — not mocks.
When to Use This Skill
Use this skill when the developer asks about:
- Writing evals — file format, assertions, turn types, setup
- Running evals — CLI commands, filtering, output interpretation
- Testing specific primitives — how to test actions, tools, workflows, conversations, tables, state
- The testing loop — write → run → inspect traces → iterate
- CI integration — exit codes,
--format jsonflag, tagging strategies - Eval configuration — idleTimeout, judgePassThreshold, judgeModel
Or when you are developing an ADK bot and need to write the equivalent of unit/end-to-end tests.
Trigger questions:
- "How do I write an eval?"
- "How do I test my workflow?"
- "How do I assert that a tool was called with specific params?"
- "My eval is failing, how do I debug it?"
- "How do I test that the bot stays silent?"
- "How do I run evals in CI?"
- "How do I seed state before an eval?"
- "How do I trigger a workflow in an eval?"
Available Documentation
| File | Contents |
|---|---|
references/eval-format.md |
Complete file format — all fields, turn types, assertion categories, match operators, setup, outcome, options |
references/testing-workflow.md |
Running evals, interpreting output, using traces, the write → test → iterate loop, CI integration |
references/test-patterns.md |
Per-primitive patterns for actions, tools, workflows, conversations, tables, and state |
How to Answer
- Writing an eval → Read
eval-format.mdfor structure and assertions - Running evals → Read
testing-workflow.mdfor CLI commands and output - Testing a specific primitive → Read
test-patterns.mdfor the relevant section - Debugging a failure → Combine
testing-workflow.md(inspect traces) +eval-format.md(check assertion syntax)
Quick Reference
Eval file structure
import { Eval } from '@botpress/adk'
export default new Eval({
name: 'greeting',
type: 'regression',
tags: ['basic'],
setup: {
state: { bot: { welcomeSent: false } },
workflow: { trigger: 'onboarding', input: { userId: 'test-1' } },
},
conversation: [
{
user: 'Hi!',
assert: {
response: [
{ not_contains: 'error' },
{ llm_judge: 'Response is friendly and offers to help' },
],
tools: [{ not_called: 'createTicket' }],
state: [{ path: 'conversation.greeted', equals: true }],
},
},
],
outcome: {
state: [{ path: 'conversation.greeted', equals: true }],
},
options: {
idleTimeout: 20000,
judgePassThreshold: 4,
},
})
Turn types
| Turn | When to use |
|---|---|
user: 'message' |
Standard user message |
event: { type, payload } |
Non-message trigger (webhook, integration event) |
expectSilence: true |
Assert bot does NOT respond |
Assertion categories
| Category | What it checks |
|---|---|
response |
Bot reply text (contains, matches, llm_judge, similar_to) |
tools |
Tool calls (called, not_called, call_order, params) |
state |
Bot/user/conversation state (equals, changed) |
tables |
Table rows (row_exists, row_count) |
workflow |
Workflow execution (entered, completed) |
timing |
Response time in ms (lte, gte) |
CLI commands
adk evals # run all evals
adk evals <name> # run one eval
adk evals --tag <tag> # filter by tag
adk evals --type regression # filter by type
adk evals --verbose # show all assertions
adk evals --format json # JSON output for CI
adk evals runs # list recent runs
adk evals runs --latest # most recent run
adk evals runs --latest -v # with full details
Critical Patterns
✅ Every turn needs user or event
// CORRECT
{ user: 'hello', expectSilence: true }
{ event: { type: 'payment.failed' }, expectSilence: true }
❌ expectSilence alone is not a valid turn
// WRONG — missing user or event
{ expectSilence: true }
✅ Assert tool params to verify correct extraction
// CORRECT — verifies the LLM extracted the right values
{ called: 'createTicket', params: { priority: { equals: 'high' } } }
❌ Only asserting the tool was called
// INCOMPLETE — doesn't verify params were correct
{ called: 'createTicket' }
✅ Use outcome for post-conversation state and table assertions
// CORRECT — final state checked once after all turns
outcome: {
state: [{ path: 'conversation.resolved', equals: true }],
tables: [{ table: 'ticketsTable', row_exists: { status: { equals: 'open' } } }],
}
❌ Checking tables in per-turn assertions when the write happens at the end
// WRONG — table may not be written until after all turns
conversation: [
{
user: 'Create a ticket',
assert: { tables: [{ table: 'ticketsTable', row_exists: { status: { equals: 'open' } } }] },
},
]
✅ Seed state to test conditional behavior without running setup turns
// CORRECT — start in a known state
setup: {
state: {
user: { plan: 'pro' },
conversation: { phase: 'support' },
},
}
❌ Using conversation turns to set up state (slow and fragile)
// WRONG — depends on the bot correctly processing setup turns
conversation: [
{ user: 'I am on the pro plan' }, // hoping bot sets user.plan
{ user: 'I need help with billing' }, // actual test turn
]
Example Questions
Writing evals:
- "Write an eval that tests my createTicket tool is called with the right priority"
- "How do I assert that the bot stays silent after an internal event?"
- "How do I test a multi-turn conversation where context is retained?"
Running evals:
- "How do I run only regression evals?"
- "How do I see which assertions failed and why?"
- "How do I integrate evals into GitHub Actions?"
Debugging:
- "My eval says the tool wasn't called but I think it was — how do I check?"
- "How do I inspect what the bot actually did during an eval?"
Per-primitive:
- "How do I test a workflow that uses step.sleep()?"
- "How do I verify a row was written to a table after a conversation?"
- "How do I test that state changed from the seeded value?"
Response Format
When helping a developer write an eval:
- Show the complete
new Eval({})call with realistic field values - Include imports (
import { Eval } from '@botpress/adk') - Explain each assertion and why it's the right choice for that scenario
- Point out any mutual exclusivity rules if relevant (
expectSilencevsassert.response,uservsevent) - Suggest the CLI command to run it:
adk evals <name>
When helping debug a failing eval:
- Ask for or show the failing assertion (
expected/actualdiff) - Suggest opening traces in the Control Panel to see what the bot did
- Identify whether the issue is in the eval assertion or the bot's behavior