eval-agents
Agent Documentation Evaluation
Validates that AGENTS.md (and the skill system) gives agents enough context to work correctly on this project. Use after modifying AGENTS.md, adding/removing skills, or changing core workflows.
Why This Exists
AGENTS.md is loaded into every agent's context. It must be:
- Complete enough that agents can make correct decisions without guessing
- Concise enough that it doesn't waste context window on info available via skills/help
- Accurate — wrong information is worse than missing information
This skill tests completeness by spawning subagents with real-world scenarios and checking if they know what to do.
When to Use
- After trimming or restructuring AGENTS.md
- After adding/removing/renaming skills
- After changing core workflows (task lifecycle, PR flow, shadow branch)
- Periodically as a health check
How It Works
Phase 1: Read Current State
Read the current AGENTS.md and the scenario reference file:
Read: AGENTS.md
Read: docs/agents-eval-scenarios.md
Phase 2: Spawn Evaluation Agents
For each scenario group, spawn a subagent that:
- Receives ONLY the current AGENTS.md content (plus skill blurbs as they would normally)
- Gets asked the scenario question
- Must answer what it would do and why
Spawn 3-4 agents in parallel, each handling a cluster of related scenarios:
Agent 1 — Setup & Architecture (Scenarios 1, 2, 11):
- First session setup
- Shadow branch understanding
- Where to find information
Agent 2 — Task Lifecycle & Loop Mode (Scenarios 3, 4, 5, 12):
- Inheriting work
- Blocking decisions
- Post-block behavior
- Batch operations
Agent 3 — Spec-First & PR Flow (Scenarios 6, 7, 10, 14, 15):
- Adding features (spec-first)
- PR workflow pairing
- Scope expansion
- Plan-to-spec
- Commit convention
Agent 4 — Testing & CI (Scenarios 8, 9, 13):
- ULID gotchas
- E2E test setup
- CI limitations
Phase 3: Grade Responses
For each scenario, compare the agent's response against the expected answer:
| Grade | Criteria |
|---|---|
| PASS | Correct action AND correct reasoning |
| PARTIAL | Correct action but wrong/missing reasoning, or mostly right with minor gaps |
| FAIL | Wrong action, would lead to incorrect behavior |
Phase 4: Report & Fix
Present results as a scorecard:
## Evaluation Results
| # | Scenario | Grade | Notes |
|---|----------|-------|-------|
| 1 | First Session Setup | PASS | Correctly identified bootstrap |
| 2 | Shadow Branch | PARTIAL | Knew CLI-only but didn't mention auto-commit |
| ... | ... | ... | ... |
**Score: 13/15 PASS, 1 PARTIAL, 1 FAIL**
### Gaps Found
- Scenario 2: AGENTS.md doesn't emphasize auto-commit enough
- ...
### Recommended Fixes
- Add sentence about auto-commit to Shadow Branch section
- ...
If FAILs are found, propose specific AGENTS.md edits to fix them.
Prompt Template for Eval Agents
Use this prompt template when spawning evaluation agents:
You are an AI agent that has been given the following project documentation.
Answer each scenario question by explaining EXACTLY what you would do and why.
Be specific about commands, order of operations, and decision rationale.
If you don't have enough information to answer confidently, say "INSUFFICIENT INFO"
and explain what's missing.
---
PROJECT DOCUMENTATION:
<paste AGENTS.md content here>
---
SCENARIOS:
<paste relevant scenarios here>
For each scenario, respond with:
1. **Action**: What you would do (specific steps)
2. **Reasoning**: Why you chose this approach
3. **Confidence**: High/Medium/Low
Scenario Reference
The full scenario set lives at docs/agents-eval-scenarios.md. Each scenario tests knowledge of a specific area:
| Scenario | Tests |
|---|---|
| 1. First Session Setup | Bootstrap, setup |
| 2. Shadow Branch Confusion | Architecture, CLI-not-YAML |
| 3. Inheriting Work | Task priority, state |
| 4. Task Blocking Decision | Blocking criteria |
| 5. After Blocking in Loop | Agent dispatch continuation |
| 6. Adding a New Feature | Spec-first flow |
| 7. PR Workflow | PR + PR-review pairing |
| 8. Test Fixture ULID | Silent failure gotcha |
| 9. E2E Test Setup | Fixture isolation |
| 10. Scope Expansion | Alignment during work |
| 11. Where to Find Info | Information hierarchy |
| 12. Batch Operations | Efficiency patterns |
| 13. CI Test Failure | CI limitations |
| 14. Plan to Implementation | Plan mode, spec-first |
| 15. Commit Convention | Trailers, linking |
Adding New Scenarios
When you discover a gap (agent made a wrong decision due to missing docs), add a scenario:
- Add to
docs/agents-eval-scenarios.mdfollowing the existing format - Include: situation, expected answer, what knowledge it tests
- Run
$eval-agentsto verify the new scenario passes with current docs - If it fails, fix AGENTS.md first, then re-run
Key Principles
- Test real decisions, not trivia — scenarios should reflect actual moments where an agent could go wrong
- Expected answers are prescriptive — they represent the project's preferred way of working
- PARTIAL is a signal — if agents consistently get partial credit on a topic, the docs need strengthening
- Low confidence = doc gap — if an agent says "I'm not sure" about something important, that's a fix needed
Long-Context Stress Testing
Tests whether agents retain AGENTS.md rules after accumulating significant context from real codebase exploration. This simulates the real failure mode: rules loaded at token 0 get diluted by token 80k+.
Why Stress Test
In real sessions, agents:
- Load AGENTS.md at conversation start
- Spend 30-100k tokens reading code, running commands, editing files
- Make a critical decision that requires recalling an AGENTS.md rule
The standard eval tests recall with minimal context. Stress testing adds realistic cognitive load between learning and recall.
Mode 1: Live Exploration (Realistic)
Spawn a single subagent that explores the real codebase, building genuine context, then answers recall questions. This is slower but most realistic.
Running Mode 1
Spawn ONE general-purpose agent with the full prompt below. The agent will:
- Read AGENTS.md (loaded naturally via project instructions)
- Execute exploration directives (building 30-50k tokens of real context)
- Answer recall questions at the end
Use model: sonnet for cost efficiency. The agent should be given generous max_turns (30+) to complete all exploration and recall.
Exploration Directives
These are ordered to build context progressively. Each has a hidden "temptation" that tests whether a specific AGENTS.md rule survives context dilution.
PHASE 1: CODEBASE EXPLORATION
You are an agent onboarding to the kynetic-spec project. Complete each
exploration task thoroughly. Read the files, analyze the code, and answer
the questions. Take your time — thoroughness matters.
TASK 1: Understand the task system
Read .kspec/project.tasks.yaml (first 200 lines) and src/schema/task.ts.
Question: What fields does a task have? What are the valid states?
TASK 2: Explore CLI command structure
Read src/cli/commands/task.ts (first 500 lines).
Question: How does the `task start` command work? What validations does it perform?
TASK 3: Understand the spec system
Read .kspec/modules/cli.yaml (first 300 lines) and src/schema/spec.ts.
Question: What types of spec items exist? How are acceptance criteria structured?
TASK 4: Explore the parser
Read src/parser/yaml.ts (first 400 lines).
Question: How does kspec load and save YAML files? What happens on save?
TASK 5: Study test patterns
Read tests/helpers/cli.ts and tests/integration.test.ts (first 300 lines).
Question: What test helpers are available? How are test fixtures set up?
TASK 6: Understand the meta system
Read src/cli/commands/meta.ts (first 400 lines).
Question: What are observations? How do they differ from inbox items?
TASK 7: Explore shadow branch mechanics
Read tests/shadow.test.ts (first 300 lines).
Question: How is the shadow branch worktree tested? What edge cases are covered?
TASK 8: Study E2E test architecture
Read packages/web-ui/tests/fixtures/test-base.ts (if it exists) and
look at packages/web-ui/tests/e2e/ for any test file.
Question: How are E2E tests structured? What does the fixture provide?
TASK 9: Review workflow system
Read tests/workflow-runs.test.ts (first 300 lines).
Question: How do workflows execute? What is the agent dispatch model?
TASK 10: Check CI configuration
Read .github/workflows/ — pick any CI workflow file.
Question: What checks run in CI? Are there any known limitations?
Recall Questions
After all exploration tasks, present these WITHOUT referencing AGENTS.md:
PHASE 2: DECISION SCENARIOS
Now that you've explored the codebase, answer these practical scenarios.
For each, explain what you would do and why.
R1: You found a typo in a task title while reading .kspec/project.tasks.yaml.
The file is right there. How do you fix it?
R2: You need to write a test fixture with a ULID for a trait.
Write out the ULID you would use.
R3: You're writing a new E2E test. Where does the file go, and what
do you import? Do you need to start a daemon?
R4: You need to capture 4 inbox items and 2 observations.
What's the most efficient way?
R5: You implemented a feature and are ready to get it merged.
Walk through the complete flow from "code done" to "task completed."
R6: You're running in agent dispatch mode and your current task requires an API
that doesn't exist yet. Tests are also failing on an unrelated function.
What do you do about each issue?
R7: A plan was just approved. What must happen before you write any
implementation code?
R8: You notice the JSON export has a bug while implementing CSV export
(a different task). What do you do?
Grading Rubric for Stress Test
| Question | Key Rule Being Tested | Pass Criteria |
|---|---|---|
| R1 | CLI-not-YAML | Uses kspec task set, does NOT edit file |
| R2 | ULID Crockford base32 | Avoids I, L, O, U — uses testUlid() or valid chars |
| R3 | E2E fixture isolation | Correct path, imports test-base, no manual daemon |
| R4 | Batch operations | Uses kspec batch, not 6 sequential commands |
| R5 | PR workflow pairing | local-review → $pr → $pr-review, task complete after merge |
| R6 | Blocking criteria | Block for missing API (valid), fix failing tests (invalid blocker) |
| R7 | Spec-first / plan mode | Create specs with ACs → derive tasks → then implement |
| R8 | Scope expansion | Capture separately, note in task, don't derail |
Scoring: Same PASS/PARTIAL/FAIL scale. Compare against standard eval to find degradation.
Interpreting Results
| Standard Eval | Stress Test | Interpretation |
|---|---|---|
| PASS | PASS | Rule is well-documented and memorable |
| PASS | PARTIAL | Rule needs reinforcement (bolder text, repetition) |
| PASS | FAIL | Rule is too subtle — needs structural prominence |
| FAIL | FAIL | Rule is missing or unclear in AGENTS.md |
Mode 2: Simulated Trace (Fast, Repeatable)
For quick regression testing, use a pre-built session trace instead of live exploration. This is deterministic and faster since it skips actual file reads.
Building a Trace
Generate a trace from a real session:
# After a real work session, export the conversation context
# (This is manual — copy tool results from a past session into a file)
Or build one synthetically by concatenating key file excerpts:
Read: src/cli/commands/task.ts (lines 1-500)
[paste actual file content]
Read: .kspec/project.tasks.yaml (lines 1-200)
[paste actual file content]
... etc for 30k+ tokens ...
Save as docs/eval-session-trace.md and use in the eval agent prompt:
PROJECT DOCUMENTATION:
<AGENTS.md content>
PREVIOUS SESSION CONTEXT:
<session trace content>
SCENARIOS:
<recall questions>
Keeping Traces Fresh
Session traces go stale as the codebase evolves. Regenerate periodically:
- After major refactoring
- When file structures change significantly
- When adding new modules or commands
Progressive Degradation Testing
To find the breaking point, run the stress test at multiple context levels:
- Light (10k tokens): 3 exploration tasks → recall
- Medium (30k tokens): 6 exploration tasks → recall
- Heavy (50k+ tokens): All 10 exploration tasks → recall
Compare scores across levels to identify which rules degrade first. Those rules need the most prominent placement in AGENTS.md.
Temptation Map
Each exploration task creates specific temptation to violate a rule:
| Exploration Task | What Agent Sees | Temptation Created |
|---|---|---|
| 1. Read task YAML | Raw status fields | Edit YAML directly |
| 2. CLI commands | Complex code | Skip to manual approach |
| 3. Spec system | AC structures in YAML | Add ACs by editing YAML |
| 4. Parser code | Auto-commit logic | Think they understand git enough to skip CLI |
| 5. Test patterns | ULID strings in tests | Copy invalid ULID patterns |
| 6. Meta system | Observation code | Confuse observations with inbox |
| 7. Shadow branch | Git worktree mechanics | Manual git operations |
| 8. E2E tests | Daemon setup code | Start daemon manually |
| 9. Workflows | Agent dispatch internals | Call end-loop prematurely |
| 10. CI config | Skipped tests | Assume CI failure = skip |
The recall questions cover the most critical rules. Not every temptation has a matching recall question — add more as gaps are discovered.