adk-evals

Installation
SKILL.md

ADK Evals Skill

What are Evals?

Evals are automated conversation tests for ADK agents. Each eval defines a scenario — a sequence of user messages or events — and asserts on what the bot should do: what it says, which tools it calls, how state changes, what gets written to tables, and more.

Evals run against a live dev bot (adk dev), so they test the full stack — not mocks.

When to Use This Skill

Use this skill when the developer asks about:

  • Writing evals — file format, assertions, turn types, setup
  • Running evals — CLI commands, filtering, output interpretation
  • Testing specific primitives — how to test actions, tools, workflows, conversations, tables, state
  • The testing loop — write → run → inspect traces → iterate
  • CI integration — exit codes, --format json flag, tagging strategies
  • Eval configuration — idleTimeout, judgePassThreshold, judgeModel

Or when you are developing an ADK bot and need to write the equivalent of unit/end-to-end tests.

Trigger questions:

  • "How do I write an eval?"
  • "How do I test my workflow?"
  • "How do I assert that a tool was called with specific params?"
  • "My eval is failing, how do I debug it?"
  • "How do I test that the bot stays silent?"
  • "How do I run evals in CI?"
  • "How do I seed state before an eval?"
  • "How do I trigger a workflow in an eval?"

Available Documentation

File Contents
references/eval-format.md Complete file format — all fields, turn types, assertion categories, match operators, setup, outcome, options
references/testing-workflow.md Running evals, interpreting output, using traces, the write → test → iterate loop, CI integration
references/test-patterns.md Per-primitive patterns for actions, tools, workflows, conversations, tables, and state

How to Answer

  1. Writing an eval → Read eval-format.md for structure and assertions
  2. Running evals → Read testing-workflow.md for CLI commands and output
  3. Testing a specific primitive → Read test-patterns.md for the relevant section
  4. Debugging a failure → Combine testing-workflow.md (inspect traces) + eval-format.md (check assertion syntax)

Quick Reference

Eval file structure

import { Eval } from '@botpress/adk'

export default new Eval({
  name: 'greeting',
  type: 'regression',
  tags: ['basic'],

  setup: {
    state: { bot: { welcomeSent: false } },
    workflow: { trigger: 'onboarding', input: { userId: 'test-1' } },
  },

  conversation: [
    {
      user: 'Hi!',
      assert: {
        response: [
          { not_contains: 'error' },
          { llm_judge: 'Response is friendly and offers to help' },
        ],
        tools: [{ not_called: 'createTicket' }],
        state: [{ path: 'conversation.greeted', equals: true }],
      },
    },
  ],

  outcome: {
    state: [{ path: 'conversation.greeted', equals: true }],
  },

  options: {
    idleTimeout: 20000,
    judgePassThreshold: 4,
  },
})

Turn types

Turn When to use
user: 'message' Standard user message
event: { type, payload } Non-message trigger (webhook, integration event)
expectSilence: true Assert bot does NOT respond

Assertion categories

Category What it checks
response Bot reply text (contains, matches, llm_judge, similar_to)
tools Tool calls (called, not_called, call_order, params)
state Bot/user/conversation state (equals, changed)
tables Table rows (row_exists, row_count)
workflow Workflow execution (entered, completed)
timing Response time in ms (lte, gte)

CLI commands

adk evals                        # run all evals
adk evals <name>                 # run one eval
adk evals --tag <tag>            # filter by tag
adk evals --type regression      # filter by type
adk evals --verbose              # show all assertions
adk evals --format json          # JSON output for CI

adk evals runs                   # list recent runs
adk evals runs --latest          # most recent run
adk evals runs --latest -v       # with full details

Critical Patterns

Every turn needs user or event

// CORRECT
{ user: 'hello', expectSilence: true }
{ event: { type: 'payment.failed' }, expectSilence: true }

expectSilence alone is not a valid turn

// WRONG — missing user or event
{ expectSilence: true }

Assert tool params to verify correct extraction

// CORRECT — verifies the LLM extracted the right values
{ called: 'createTicket', params: { priority: { equals: 'high' } } }

Only asserting the tool was called

// INCOMPLETE — doesn't verify params were correct
{ called: 'createTicket' }

Use outcome for post-conversation state and table assertions

// CORRECT — final state checked once after all turns
outcome: {
  state: [{ path: 'conversation.resolved', equals: true }],
  tables: [{ table: 'ticketsTable', row_exists: { status: { equals: 'open' } } }],
}

Checking tables in per-turn assertions when the write happens at the end

// WRONG — table may not be written until after all turns
conversation: [
  {
    user: 'Create a ticket',
    assert: { tables: [{ table: 'ticketsTable', row_exists: { status: { equals: 'open' } } }] },
  },
]

Seed state to test conditional behavior without running setup turns

// CORRECT — start in a known state
setup: {
  state: {
    user: { plan: 'pro' },
    conversation: { phase: 'support' },
  },
}

Using conversation turns to set up state (slow and fragile)

// WRONG — depends on the bot correctly processing setup turns
conversation: [
  { user: 'I am on the pro plan' },      // hoping bot sets user.plan
  { user: 'I need help with billing' },   // actual test turn
]

Example Questions

Writing evals:

  • "Write an eval that tests my createTicket tool is called with the right priority"
  • "How do I assert that the bot stays silent after an internal event?"
  • "How do I test a multi-turn conversation where context is retained?"

Running evals:

  • "How do I run only regression evals?"
  • "How do I see which assertions failed and why?"
  • "How do I integrate evals into GitHub Actions?"

Debugging:

  • "My eval says the tool wasn't called but I think it was — how do I check?"
  • "How do I inspect what the bot actually did during an eval?"

Per-primitive:

  • "How do I test a workflow that uses step.sleep()?"
  • "How do I verify a row was written to a table after a conversation?"
  • "How do I test that state changed from the seeded value?"

Response Format

When helping a developer write an eval:

  1. Show the complete new Eval({}) call with realistic field values
  2. Include imports (import { Eval } from '@botpress/adk')
  3. Explain each assertion and why it's the right choice for that scenario
  4. Point out any mutual exclusivity rules if relevant (expectSilence vs assert.response, user vs event)
  5. Suggest the CLI command to run it: adk evals <name>

When helping debug a failing eval:

  1. Ask for or show the failing assertion (expected / actual diff)
  2. Suggest opening traces in the Control Panel to see what the bot did
  3. Identify whether the issue is in the eval assertion or the bot's behavior
Weekly Installs
3
Repository
botpress/skills
GitHub Stars
8
First Seen
7 days ago
Installed on
amp3
cline3
opencode3
cursor3
kimi-cli3
warp3