build-test-suite
Build Test Suite
Guide the user through building a complete test suite — test set + test cases with expected behaviors — for evaluating an AI agent using the coval CLI. Follow the phases below in order, asking questions at each step.
If $ARGUMENTS contains an agent name or use case, use it to skip or pre-fill questions in Phases 1-2.
Phase 0: Setup + Preflight
Step 1: Check authentication
coval whoami
If not authenticated, guide the user:
coval login
This prompts for an API key. Get one at https://app.coval.dev/settings (Organization > Manage > API Keys).
If the user doesn't have a Coval account, direct them to https://coval.dev to sign up.
Step 2: Inventory existing resources
Run these in parallel:
coval agents list --format json
coval test-sets list --format json
Note existing agents and test sets for reference throughout the flow.
Phase 1: Agent Context
Ask: "Which agent are these tests for?"
- If agents exist, present a numbered list and let the user pick or say "new"
- If
$ARGUMENTSmatches an agent name, select it automatically
Fetch the selected agent's details:
coval agents get <agent_id> --format json
Capture from the response:
agent_idmodel_type(voice, chat, etc.)prompt(system prompt, if available)display_name
If the agent has a system prompt, use it later to generate more specific, domain-relevant test scenarios instead of generic templates.
Phase 2: Test Set Type Selection
Load references/test-set-types.md and present the available types.
Ask: "What type of test set do you want to create?"
- SCENARIO is the default and best for most use cases
- Explain when each type is appropriate based on the reference
- If the user is unsure, recommend SCENARIO
Note: Test set type is not configurable via the CLI — all test sets default to SCENARIO type. To create other types, use the API:
POST /v1/test-setswith atest_set_typefield.
Then ask:
- "What would you like to name this test set?" — suggest:
"<Agent Name> Evaluation" - "Brief description?" — suggest based on agent type and use case
Create the test set:
coval test-sets create --name "<name>" --description "<desc>" --format json
Capture test_set_id from the JSON response.
Phase 3: Scenario Design
Load references/test-case-templates.md and select the templates matching the agent's vertical/use case.
Present the 3-category pattern:
- happy_path — The standard, successful interaction
- edge_case — Unusual or challenging situations
- compliance — Regulatory, policy, or safety requirements
If the agent has a system prompt, customize the scenarios to be specific to the agent's domain rather than using generic templates. For example, if the agent handles dental appointments, tailor scenarios to dental-specific situations.
Present a summary table before creating:
Test Set: "<name>"
[happy_path] <test case name>
<scenario description>
[edge_case] <test case name>
<scenario description>
[compliance] <test case name>
<scenario description>
Ask: "Create these test cases? (yes / customize / add more)"
- yes → proceed to Phase 4
- customize → let the user edit scenarios, then re-present
- add more → generate additional scenarios, then re-present
Phase 4: Expected Behaviors
For each test case, help craft an expected_behaviors array. These are what the Composite Evaluation metric scores against.
Good expected behaviors are:
- Specific — describes a concrete action or output
- Observable — can be verified from the conversation transcript
- Binary — it either happened or it didn't
Examples of GOOD expected behaviors:
- "Agent verifies caller identity before sharing account details"
- "Agent provides a confirmation number"
- "Agent offers at least two alternative time slots"
- "Agent does NOT share information from a different policy"
Examples of BAD expected behaviors (avoid these):
- "Agent is helpful" — too vague
- "Agent sounds nice" — subjective
- "Agent handles the situation well" — not observable
Present each test case with its expected behaviors for confirmation. Let the user add, remove, or edit behaviors.
Phase 5: Bulk Creation
Create each test case:
coval test-cases create \
--test-set-id <test_set_id> \
--input '<scenario text>' \
--expected "Agent greets the customer professionally" \
--expected "Agent verifies caller identity" \
--expected "Agent resolves the issue or escalates" \
--description "<test case name>" \
--format json
Pass each expected behavior as a separate --expected flag. This ensures they are stored as individual items in the expected_behaviors array, which the Composite Evaluation metric scores individually.
Shell tip: Use single quotes for
--inputvalues to avoid shell interpolation issues (e.g.,$45.99becoming.99).
If the CLI does not support multiple --expected flags, use the Coval API directly for structured expected behaviors:
curl -s -X POST https://api.coval.dev/v1/test-cases \
-H "X-API-Key: $COVAL_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"test_set_id": "<test_set_id>",
"input_str": "<scenario text>",
"expected_behaviors": [
"Agent greets the customer professionally",
"Agent verifies caller identity",
"Agent resolves the issue or escalates"
],
"description": "<test case name>"
}'
Present progress as each test case is created. Capture test_case_id from each response.
Phase 6: Coverage Summary + Next Steps
Present what was created:
Test Suite Complete!
Test Set: <name> (<test_set_id>)
Test Cases: <N> total
[happy_path] <count>
[edge_case] <count>
[compliance] <count>
Coverage Analysis
Review the test cases and suggest areas that might need more coverage:
- Are there common failure modes not covered?
- Are there regulatory requirements specific to the vertical?
- Would the agent benefit from multi-turn conversation tests?
- Are there language/accent scenarios worth testing (for voice agents)?
Suggested Next Steps
- Design a test persona:
/design-persona - Configure evaluation metrics:
/configure-metrics - Launch a quick evaluation:
/quick-eval - Add more test cases later:
coval test-cases create --test-set-id <test_set_id> --input "..." --expected "..." --description "..."
More from coval-ai/coval-external-skills
onboard
>
14launch-run
Launch a Coval evaluation run against an AI agent. Use when user wants to start an evaluation, test an agent, or run simulations.
12coval-resources
Comprehensive overview of ALL Coval platform resources, their hierarchy, relationships, API endpoints, and ID formats. Use when user asks about Coval resources, data model, how things relate, what endpoints exist, or needs context about the platform structure before making API calls.
12quick-eval
Full evaluation workflow - launch a run, watch progress, and summarize results. Use for end-to-end agent testing.
12download-audio
Download audio recordings from Coval voice simulations. Use when user wants to listen to or analyze call recordings.
12watch-run
Monitor a Coval run's progress with live updates. Use when user wants to check run status or wait for completion.
12