EvalView Agent Testing

Automated regression testing for AI agents. EvalView snapshots your agent's behavior (tool calls, parameters, sequence, output), then diffs against the baseline after every change. When something breaks, you know immediately — before it ships.

When to Activate

After modifying agent code, prompts, or tool definitions
After a model update or provider change
Before deploying an agent to production
When setting up CI/CD for an agent project
When an autonomous loop (OpenClaw, coding agents) needs a fitness function
When agent output changes unexpectedly and you need to identify what shifted

Core Workflow

# 1. Set up
pip install "evalview>=0.5,<1"
evalview init              # Detect agent, create starter test suite

# 2. Baseline
evalview snapshot           # Save current behavior as golden baseline

# 3. Gate every change
evalview check              # Diff against baseline — catches regressions

# 4. Monitor in production
evalview monitor --slack-webhook https://hooks.slack.com/services/...

Understanding Check Results

Status	Meaning	Action
`PASSED`	Behavior matches baseline	Ship with confidence
`TOOLS_CHANGED`	Different tools called	Review the diff
`OUTPUT_CHANGED`	Same tools, output shifted	Review the diff
`REGRESSION`	Score dropped significantly	Fix before shipping

Python API for Autonomous Loops

Use gate() as a programmatic regression gate inside agent frameworks, autonomous coding loops, or CI scripts:

from evalview import gate, DiffStatus

# Full evaluation
result = gate(test_dir="tests/")
if not result.passed:
    for d in result.diffs:
        if not d.passed:
            delta = f" ({d.score_delta:+.1f})" if d.score_delta is not None else ""
            print(f"  {d.test_name}: {d.status.value}{delta}")

# Quick mode — no LLM judge, $0, sub-second
result = gate(test_dir="tests/", quick=True)

Auto-Revert on Regression

from evalview.openclaw import gate_or_revert

# In an autonomous coding loop:
make_code_change()
if not gate_or_revert("tests/", quick=True):
    # Change was automatically reverted
    try_alternative_approach()

Warning: gate_or_revert runs git checkout -- . when a regression is detected, discarding uncommitted changes. Commit or stash work-in-progress before entering the loop. You can also pass a custom revert command: gate_or_revert("tests/", revert_cmd="git stash").

MCP Integration

EvalView exposes 8 tools via MCP — works with Claude Code, Cursor, and any MCP client:

claude mcp add --transport stdio evalview -- evalview mcp serve

Tools: create_test, run_snapshot, run_check, list_tests, validate_skill, generate_skill_tests, run_skill_test, generate_visual_report

After connecting, Claude Code can proactively check for regressions after code changes:

"Did my refactor break anything?" triggers run_check
"Save this as the new baseline" triggers run_snapshot
"Add a test for the weather tool" triggers create_test

CI/CD Integration

# .github/workflows/evalview.yml
name: Agent Regression Check
on: [pull_request, push]
jobs:
  check:
    runs-on: ubuntu-latest
    env:
      OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
    steps:
      - uses: actions/checkout@v4
      - run: pip install "evalview>=0.5,<1"
      - run: evalview check --fail-on REGRESSION

--fail-on REGRESSION gates on score drops only. For stricter gating that also catches tool sequence changes, use --fail-on REGRESSION,TOOLS_CHANGED or --strict (fails on any change).

Test Case Format

name: refund-flow
input:
  query: "I need a refund for order #4812"
expected:
  tools: ["lookup_order", "check_refund_policy", "issue_refund"]
  forbidden_tools: ["delete_order"]
  output:
    contains: ["refund", "processed"]
    not_contains: ["error"]
thresholds:
  min_score: 70

Multi-turn tests are also supported:

name: clarification-flow
turns:
  - query: "I want a refund"
    expected:
      output:
        contains: ["order number"]
  - query: "Order 4812"
    expected:
      tools: ["lookup_order", "issue_refund"]

Best Practices

Snapshot after every intentional change. Baselines should reflect intended behavior.
Use --preview before snapshotting. evalview snapshot --preview shows what would change without saving.
Quick mode for tight loops. gate(quick=True) skips the LLM judge — free and fast for iterative development.
Full evaluation for final validation. Run without quick=True before deploying to get LLM-as-judge scoring.
Commit .evalview/golden/ to git. Baselines should be versioned. Don't commit state.json.
Use variants for non-deterministic agents. evalview snapshot --variant v2 stores alternate valid behaviors (up to 5).
Monitor in production. evalview monitor catches gradual drift that individual checks miss.

Installation

pip install "evalview>=0.5,<1"

Package: evalview on PyPI

evalview-agent-testing

EvalView Agent Testing

When to Activate

Core Workflow

Understanding Check Results

Python API for Autonomous Loops

Auto-Revert on Regression

MCP Integration

CI/CD Integration

Test Case Format

Best Practices

Installation

More from affaan-m/everything-claude-code

security-review

golang-patterns

coding-standards

frontend-patterns

backend-patterns

golang-testing