evalview-agent-testing
EvalView Agent Testing
Automated regression testing for AI agents. EvalView snapshots your agent's behavior (tool calls, parameters, sequence, output), then diffs against the baseline after every change. When something breaks, you know immediately — before it ships.
When to Activate
- After modifying agent code, prompts, or tool definitions
- After a model update or provider change
- Before deploying an agent to production
- When setting up CI/CD for an agent project
- When an autonomous loop (OpenClaw, coding agents) needs a fitness function
- When agent output changes unexpectedly and you need to identify what shifted
Core Workflow
# 1. Set up
pip install "evalview>=0.5,<1"
evalview init # Detect agent, create starter test suite
# 2. Baseline
evalview snapshot # Save current behavior as golden baseline
# 3. Gate every change
evalview check # Diff against baseline — catches regressions
# 4. Monitor in production
evalview monitor --slack-webhook https://hooks.slack.com/services/...
Understanding Check Results
| Status | Meaning | Action |
|---|---|---|
PASSED |
Behavior matches baseline | Ship with confidence |
TOOLS_CHANGED |
Different tools called | Review the diff |
OUTPUT_CHANGED |
Same tools, output shifted | Review the diff |
REGRESSION |
Score dropped significantly | Fix before shipping |
Python API for Autonomous Loops
Use gate() as a programmatic regression gate inside agent frameworks, autonomous coding loops, or CI scripts:
from evalview import gate, DiffStatus
# Full evaluation
result = gate(test_dir="tests/")
if not result.passed:
for d in result.diffs:
if not d.passed:
delta = f" ({d.score_delta:+.1f})" if d.score_delta is not None else ""
print(f" {d.test_name}: {d.status.value}{delta}")
# Quick mode — no LLM judge, $0, sub-second
result = gate(test_dir="tests/", quick=True)
Auto-Revert on Regression
from evalview.openclaw import gate_or_revert
# In an autonomous coding loop:
make_code_change()
if not gate_or_revert("tests/", quick=True):
# Change was automatically reverted
try_alternative_approach()
Warning:
gate_or_revertrunsgit checkout -- .when a regression is detected, discarding uncommitted changes. Commit or stash work-in-progress before entering the loop. You can also pass a custom revert command:gate_or_revert("tests/", revert_cmd="git stash").
MCP Integration
EvalView exposes 8 tools via MCP — works with Claude Code, Cursor, and any MCP client:
claude mcp add --transport stdio evalview -- evalview mcp serve
Tools: create_test, run_snapshot, run_check, list_tests, validate_skill, generate_skill_tests, run_skill_test, generate_visual_report
After connecting, Claude Code can proactively check for regressions after code changes:
- "Did my refactor break anything?" triggers
run_check - "Save this as the new baseline" triggers
run_snapshot - "Add a test for the weather tool" triggers
create_test
CI/CD Integration
# .github/workflows/evalview.yml
name: Agent Regression Check
on: [pull_request, push]
jobs:
check:
runs-on: ubuntu-latest
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
steps:
- uses: actions/checkout@v4
- run: pip install "evalview>=0.5,<1"
- run: evalview check --fail-on REGRESSION
--fail-on REGRESSION gates on score drops only. For stricter gating that also catches tool sequence changes, use --fail-on REGRESSION,TOOLS_CHANGED or --strict (fails on any change).
Test Case Format
name: refund-flow
input:
query: "I need a refund for order #4812"
expected:
tools: ["lookup_order", "check_refund_policy", "issue_refund"]
forbidden_tools: ["delete_order"]
output:
contains: ["refund", "processed"]
not_contains: ["error"]
thresholds:
min_score: 70
Multi-turn tests are also supported:
name: clarification-flow
turns:
- query: "I want a refund"
expected:
output:
contains: ["order number"]
- query: "Order 4812"
expected:
tools: ["lookup_order", "issue_refund"]
Best Practices
- Snapshot after every intentional change. Baselines should reflect intended behavior.
- Use
--previewbefore snapshotting.evalview snapshot --previewshows what would change without saving. - Quick mode for tight loops.
gate(quick=True)skips the LLM judge — free and fast for iterative development. - Full evaluation for final validation. Run without
quick=Truebefore deploying to get LLM-as-judge scoring. - Commit
.evalview/golden/to git. Baselines should be versioned. Don't commitstate.json. - Use variants for non-deterministic agents.
evalview snapshot --variant v2stores alternate valid behaviors (up to 5). - Monitor in production.
evalview monitorcatches gradual drift that individual checks miss.
Installation
pip install "evalview>=0.5,<1"
Package: evalview on PyPI
More from affaan-m/everything-claude-code
security-review
Use this skill when adding authentication, handling user input, working with secrets, creating API endpoints, or implementing payment/sensitive features. Provides comprehensive security checklist and patterns.
7.5Kgolang-patterns
Idiomatic Go patterns, best practices, and conventions for building robust, efficient, and maintainable Go applications.
7.1Kcoding-standards
Baseline cross-project coding conventions for naming, readability, immutability, and code-quality review. Use detailed frontend or backend skills for framework-specific patterns.
6.4Kfrontend-patterns
Frontend development patterns for React, Next.js, state management, performance optimization, and UI best practices.
6.3Kbackend-patterns
Backend architecture patterns, API design, database optimization, and server-side best practices for Node.js, Express, and Next.js API routes.
6.3Kgolang-testing
Go testing patterns including table-driven tests, subtests, benchmarks, fuzzing, and test coverage. Follows TDD methodology with idiomatic Go practices.
5.8K