Agent Harness Framework

Build custom agents that follow a structured orchestrator pattern: plan → execute → evaluate → loop → evolve.

Architecture

workdir/                          # Agent's root working directory
├── {agent-name}.json             # Agent config (goes to .kiro/agents/)
├── tasks/
│   ├── {task-id}/                # One dir per task execution
│   │   ├── goal.md               # Task goal, acceptance criteria, and evaluation rubric
│   │   ├── plan.md               # Latest planner output (symlink or copy)
│   │   ├── work/                 # Latest worker output artifacts
│   │   ├── eval.md               # Latest evaluator scores and feedback
│   │   ├── iterations.json       # Loop history with scores and changelog
│   │   ├── history/              # Versioned snapshots per round
│   │   │   ├── round-1/
│   │   │   │   ├── plan.md
│   │   │   │   ├── work/
│   │   │   │   ├── eval.md
│   │   │   │   └── changelog.md  # What changed from previous round
│   │   │   ├── round-2/
│   │   │   │   └── ...
│   │   │   └── ...
│   │   └── context/              # Shared context from other agents/runs
│   └── ...
├── evolution/
│   ├── experience-log.json       # Raw execution history (append-only)
│   └── lessons.md                # Curated lessons (<200 lines)
└── skills/
    └── {skill-name}/SKILL.md     # Agent-specific skills

Workflow

1. Scaffold Agent

Create the agent workdir and config. Use references/agent-schema.md for the full config schema.

mkdir -p {workdir}/tasks {workdir}/evolution {workdir}/skills

Generate {agent-name}.json with:

tools: must include subagent, fs_read, fs_write, fs_list, shell
toolsSettings.subagent.trustedAgents: list dedicated subagent names (not "default" — see Subagent Design below)
resources: link to this skill + any domain skills
prompt: include the orchestrator instructions (see Orchestrator Prompt below), referencing subagents by name

Generate a separate agent config for each subagent role (planner, worker, evaluator). See Subagent Design below.

2. Create Task

For each goal, create a task directory:

task_id="{slug}-$(head -c4 /dev/urandom | xxd -p)"
mkdir -p {workdir}/tasks/${task_id}/{work,context,history}

Task ID format: {slug}-{8char_hex} (e.g., blog-post-draft-a1b2c3d4). The slug is a short kebab-case name derived from the task goal (3-5 words max), and the hex suffix prevents collisions.

To resume an existing task, the user provides the task dir name (e.g., "resume blog-post-draft-a1b2c3d4"). The orchestrator should:

Check {workdir}/tasks/{task_id}/ exists
Read iterations.json to determine which step to resume from
Read eval.md if it exists — if FAIL, resume from Plan step with eval feedback
If no eval.md, check what files exist: no plan.md → start from Plan, no work/ output → start from Execute, no eval.md → start from Evaluate

To refine a completed task (user sends follow-up feedback after PASS or max iterations):

The orchestrator stays on the same task — does NOT create a new one
Writes the user's feedback to context/prev-eval.md
Re-enters the loop from Plan step, continuing the iteration count
This is the default behavior for any follow-up message. Only start a new task if the user explicitly asks for a different topic.

Write goal.md with:

Goal statement (what to achieve)
Acceptance criteria (how to know it's done)
Evaluation rubric (custom scoring dimensions for this task — see Evaluation Rubric Design below)
Settings (optional — overrides agent defaults):
- pass_threshold: minimum score each dimension must meet to PASS (omit to use agent default)
- max_iterations: maximum plan→execute→evaluate loops (omit to use agent default)

Settings precedence: goal.md Settings section > orchestrator prompt defaults. If goal.md omits Settings, the orchestrator's defaults apply.

Example goal.md:

## Goal
Generate a quarterly summary report from project tracking data.

## Acceptance Criteria
- Report covers all 4 quadrants: Highlights, Lowlights, Risks, Observations
- Each entry uses What/So What/Now What format
- Data sourced from project tracker, not fabricated

## Evaluation Rubric
| Dimension | Weight | What to check |
|-----------|--------|---------------|
| Data Accuracy | 0.3 | Every claim traceable to a data source |
| Format Compliance | 0.2 | All entries follow What/So What/Now What |
| Coverage | 0.3 | All 4 quadrants populated, no major gaps |
| Clarity | 0.2 | Concise, no jargon, actionable language |

## Settings
- pass_threshold: 8
- max_iterations: 2

If the Settings section is omitted entirely, the orchestrator uses its own defaults (defined in its prompt — typically pass_threshold: 7, max_iterations: 3, but each agent sets its own). The orchestrator MUST always write the effective pass_threshold into goal.md so the evaluator knows the target — the evaluator has no access to the orchestrator's prompt.

3. Plan (Subagent)

Spawn a planner subagent that:

Reads goal.md
Reads context/ for prior run data or cross-agent context
Reads evolution/lessons.md for accumulated experience
Produces plan.md with ordered steps, expected outputs, and risk flags

Planner prompt pattern:

Read {task_dir}/goal.md and {task_dir}/context/ for background.
Read {workdir}/evolution/lessons.md for past experience.
Write a plan to {task_dir}/plan.md with:
- Numbered steps with expected output per step
- Dependencies between steps
- Risk flags and mitigations
- Estimated complexity (low/medium/high)

4. Execute (Subagent)

Spawn worker subagent(s) that:

Read plan.md
Execute each step, writing artifacts to work/
Log progress inline in artifacts

Worker prompt pattern:

Read {task_dir}/plan.md. Execute each step.
Write all output artifacts to {task_dir}/work/.
If a step fails, document the failure and continue to next step.
Do not skip steps without documenting why.

For parallelizable work, spawn multiple workers — each writes to a named subdirectory under work/.

5. Evaluate (Subagent)

Spawn evaluator subagent that:

Reads goal.md — specifically the Evaluation Rubric section for scoring dimensions and weights
Reads plan.md (what was planned)
Reads work/ (what was produced)
Reads previous eval.md if this is a retry (to avoid repeating feedback)
Scores each rubric dimension 1-10, computes weighted overall score
Writes eval.md with pass/fail verdict

Evaluator prompt pattern:

Read {task_dir}/goal.md — extract the Evaluation Rubric (dimensions, weights, criteria).
Read {task_dir}/plan.md for intended approach.
Read {task_dir}/work/ for actual output.
If {task_dir}/eval.md exists, read it — do not repeat prior feedback.
If {task_dir}/history/ exists, read all prior round eval.md and changelog.md files
to build a cumulative fixes checklist (see Plan Adherence below).

PLAN ADHERENCE CHECK (mandatory for every round):
1. Read plan.md and extract every numbered step and its expected output.
2. For each step, verify the corresponding output exists in work/.
   If a step has no output and no documented skip reason, flag it.
3. If plan.md contains a "Verification Checklist" section, evaluate every
   checklist item and report pass/fail per item in eval.md.
4. For rounds > 1: build a CUMULATIVE FIXES CHECKLIST from all prior rounds.
   Scan history/round-{1..N-1}/eval.md for every fix requested, and
   history/round-{1..N-1}/changelog.md for every fix applied.
   Verify each prior fix is STILL PRESENT in the current work/ output.
   If any prior fix has regressed, flag it as a regression and deduct from
   the relevant dimension. Regressions are scored more harshly than
   first-time misses (a fix that was applied and then lost is worse than
   one never applied).

For EACH dimension in the rubric:
- Score 1-10 based on the "What to check" criteria
- Provide 1-2 sentence justification

Compute overall score as weighted average (sum of score × weight).
If no weights specified, use equal weights.

Verdict rule: PASS only if EVERY dimension scores >= threshold.
If ANY single dimension scores below threshold, verdict is FAIL —
even if the weighted average exceeds the threshold.

Write {task_dir}/eval.md with:
- Plan Adherence Report:
  - Steps completed vs planned (e.g., "8/9 steps completed")
  - Verification Checklist results (if plan.md has one): pass/fail per item
  - Cumulative Fixes Status (rounds > 1): list every prior-round fix,
    whether it is still present (✓) or regressed (✗), with file/line ref
  - Any regressions flagged with "[REGRESSION]" tag
- Per-dimension scores, weights, whether each meets threshold, and justification
- Overall weighted score (informational only — verdict is per-dimension)
- List of dimensions below threshold (or "None")
- PASS or FAIL based on per-dimension rule above
- If FAIL: specific, actionable feedback for retry tied to below-threshold dimensions

6. Loop

After each evaluate step, the orchestrator:

Archives the current round: copy plan.md, work/, and eval.md into history/round-{N}/
Writes history/round-{N}/changelog.md summarizing what changed from the previous round (skip for round 1)
Updates iterations.json with the round's scores and changelog summary
Checks eval.md:
- If PASS or max iterations reached → proceed to evolve step
- If FAIL → copy eval feedback to context/prev-eval.md, re-run from Plan step

Changelog format (history/round-{N}/changelog.md):

## Round {N} Changes (from Round {N-1})

### Eval Feedback Addressed
- {dimension}: {what the evaluator said} → {what changed}

### Plan Changes
- {what the planner changed and why}

### Output Changes
- {specific sections rewritten, data added, structure changed}

### Score Delta
| Dimension | Round {N-1} | Round {N} | Delta |
|-----------|-------------|-----------|-------|
| ...       | ...         | ...       | ...   |

iterations.json structure:

{
  "task_id": "...",
  "goal": "...",
  "threshold": 9,
  "max_iterations": 10,
  "rubric_dimensions": ["Data Accuracy", "Format Compliance", "Coverage", "Clarity"],
  "iterations": [
    {
      "round": 1,
      "plan_summary": "...",
      "scores": {
        "Data Accuracy": {"score": 6, "weight": 0.3},
        "Format Compliance": {"score": 8, "weight": 0.2},
        "Coverage": {"score": 5, "weight": 0.3},
        "Clarity": {"score": 7, "weight": 0.2}
      },
      "overall": 6.3,
      "verdict": "FAIL",
      "dimensions_below_threshold": ["Data Accuracy", "Coverage", "Clarity"],
      "feedback_summary": "Coverage gaps in Risks quadrant; Data Accuracy needs source links",
      "changelog_summary": null
    },
    {
      "round": 2,
      "plan_summary": "...",
      "scores": {
        "Data Accuracy": {"score": 8, "weight": 0.3},
        "Format Compliance": {"score": 9, "weight": 0.2},
        "Coverage": {"score": 8, "weight": 0.3},
        "Clarity": {"score": 8, "weight": 0.2}
      },
      "overall": 8.2,
      "verdict": "FAIL",
      "dimensions_below_threshold": ["Data Accuracy"],
      "feedback_summary": "Data Accuracy still missing 2 source links in Background",
      "changelog_summary": "Added Risks quadrant content, linked 4 data sources to claims"
    }
  ]
}

7. Evolve

After task completion (pass or max iterations), run evolution:

Append to evolution/experience-log.json:

{
  "task_id": "...",
  "timestamp": "ISO-8601",
  "goal_summary": "...",
  "iterations_count": 2,
  "final_score": 8.5,
  "verdict": "PASS",
  "key_learnings": ["...", "..."],
  "skill_gaps": ["...", "..."]
}

Curate evolution/lessons.md:
- Check if new learnings duplicate existing lessons → merge
- Add new actionable lessons (1-2 lines each, "Do X when Y" format)
- If approaching 200 lines, retire least impactful entries
- Group by category: Planning, Execution, Evaluation, Domain
If skill improvements identified, propose updates (don't auto-apply without user confirmation).

Orchestrator Prompt Template

Use this as the base prompt for the major agent. Reference subagents by their dedicated agent names:

You are an ORCHESTRATOR ONLY. You do NOT research, write, or evaluate.
You ONLY coordinate subagents and manage the task loop.

Subagents:
- {planner-name}: Researches and plans the work
- {worker-name}: Executes the plan and produces artifacts
- {evaluator-name}: Scores output against the rubric in goal.md

CRITICAL RULES:
- You MUST delegate ALL research/data gathering to {planner-name}.
- You MUST delegate ALL content creation to {worker-name}.
- You MUST delegate ALL evaluation to {evaluator-name}.
- NEVER write to plan.md, work/, context/research.md, or eval.md yourself.
  Those are subagent outputs.
- Your ONLY writes are: goal.md, iterations.json, context/prev-eval.md (copying
  eval feedback), and evolution files.

Agent defaults (used when goal.md omits Settings section):
- pass_threshold: {set per agent, recommend 9 for quality-critical work}
- max_iterations: {set per agent, recommend 10 to allow thorough refinement}
When writing goal.md, include a Settings section ONLY if the user specifies
custom values. Otherwise omit it and these defaults apply.
Always write the effective pass_threshold into goal.md so the evaluator knows the target.

For each task:
1. If user provides an existing task_id (e.g., "resume blog-post-draft-a1b2c3d4"),
   resume it — see resume logic below. Otherwise generate new task_id:
   {slug}-{8char_hex_random} where slug is a short kebab-case name from the goal
2. Create task dir: {workdir}/tasks/{task_id}/ with work/, context/, and history/ subdirs
3. Write goal.md with the goal, acceptance criteria, and evaluation rubric
4. Spawn {planner-name}: 'Task dir: {workdir}/tasks/{task_id}. Read goal.md. Write plan.md.'
5. Verify plan.md exists. Spawn {worker-name}: 'Task dir: {workdir}/tasks/{task_id}. Read plan.md. Write to work/.'
6. Verify work output exists. Spawn {evaluator-name}: 'Task dir: {workdir}/tasks/{task_id}. Read goal.md for rubric. Evaluate work/. Write eval.md.'
7. Archive round: copy plan.md, work/, eval.md to history/round-{N}/. Write changelog.md.
8. Read eval.md. If FAIL and iterations < max: copy eval to context/prev-eval.md, loop from step 4
9. If PASS or max iterations reached: read final output and present to user, run evolution

Resume logic:
- Verify {workdir}/tasks/{task_id}/ exists
- Check files to determine resume point:
  eval.md with FAIL verdict → re-plan (step 4) with eval feedback in context/
  work/ has output but no eval.md → evaluate (step 6)
  plan.md exists but work/ empty → execute (step 5)
  Only goal.md → plan (step 4)

Refine logic (user gives feedback on a completed task):
- When the user sends a follow-up message after a task has PASSED (or reached
  max iterations), treat it as a refinement request on the SAME task.
- Write the user's feedback to context/prev-eval.md (same as eval feedback)
- Re-enter the loop from step 4 (plan) with the user feedback as context
- The planner and writer will read prev-eval.md and adjust accordingly
- Continue the iteration count from where it left off
- This is the DEFAULT behavior for any user message that follows a completed task.
  Do NOT start a new task unless the user explicitly asks for a different topic.

Always include the full task_dir path when spawning subagents.
Always read evolution/lessons.md before starting (if it exists).
After completion, append to evolution/experience-log.json and curate lessons.md.

Agent Config Template

See references/agent-schema.md for the full Kiro agent JSON schema.

Orchestrator config (note: NO domain-specific MCP tools — only coordination tools):

{
  "name": "{agent-name}",
  "description": "Orchestrator agent with plan→execute→evaluate loop",
  "prompt": "... (orchestrator prompt above with hard prohibitions) ...",
  "mcpServers": {},
  "tools": ["fs_read", "fs_write", "fs_list", "shell", "subagent"],
  "allowedTools": ["fs_read", "fs_write", "fs_list", "shell", "subagent"],
  "resources": ["skill://.kiro/skills/agent-harness/SKILL.md"],
  "toolsSettings": {
    "subagent": { "trustedAgents": ["{planner-name}", "{worker-name}", "{evaluator-name}"] },
    "shell": { "autoAllowReadonly": true, "deniedCommands": ["rm .*"] },
    "write": { "allowedPaths": ["{workdir}/**", ".kiro/data/**"] }
  }
}

Subagent Design

Create dedicated custom agents for each role instead of using default. Each subagent gets its own config file with a focused prompt, minimum-privilege tools, and only the skills it needs.

Why dedicated subagents

default agent has no domain-specific prompt — it guesses what to do from the orchestrator's inline instructions
Dedicated agents have their own system prompt, so they know their role without the orchestrator repeating instructions
Tool access is scoped per role — the planner gets API access, the writer gets file-only, the evaluator gets read/write only
Skills in resources are loaded into the subagent's context automatically — but the prompt should still explicitly tell the agent to read them

Naming convention

Use {domain}-{role} pattern: report-planner, report-writer, report-evaluator.

Subagent config patterns

Planner (needs API/data access):

{
  "name": "{domain}-planner",
  "prompt": "You are a research and planning specialist for {domain}. Read goal.md, query data sources, write research to context/research.md and plan to plan.md.",
  "mcpServers": { "...": "..." },
  "tools": ["@server/tool-a", "fs_read", "fs_write", "fs_list"],
  "allowedTools": ["@server/tool-a", "fs_read", "fs_write", "fs_list"],
  "resources": ["skill://.kiro/skills/{data-skill}/SKILL.md"],
  "toolsSettings": { "write": { "allowedPaths": ["{workdir}/**"] } }
}

Worker (needs domain skills, usually no API access):

{
  "name": "{domain}-worker",
  "prompt": "You are a {domain} specialist. Read plan.md and context/. Produce artifacts in work/. Read {skill-path} and apply its rules.",
  "mcpServers": {},
  "tools": ["fs_read", "fs_write", "fs_list"],
  "allowedTools": ["fs_read", "fs_write", "fs_list"],
  "resources": ["skill://.kiro/skills/{domain-skill}/SKILL.md", "skill://.kiro/skills/{writing-skill}/SKILL.md"],
  "toolsSettings": { "write": { "allowedPaths": ["{workdir}/**"] } }
}

Evaluator (read/write only, no API, no shell):

{
  "name": "{domain}-evaluator",
  "prompt": "You are a quality evaluator for {domain}. Read goal.md for the rubric. Read work/ for output. Score each dimension 1-10. Write eval.md.",
  "mcpServers": {},
  "tools": ["fs_read", "fs_write", "fs_list"],
  "allowedTools": ["fs_read", "fs_write", "fs_list"],
  "resources": ["skill://.kiro/skills/{domain-skill}/SKILL.md"],
  "toolsSettings": { "write": { "allowedPaths": ["{workdir}/**"] } }
}

Key principles

Least privilege: Only give each subagent the tools it actually needs. Planner needs API access. Worker usually doesn't. Evaluator never does.
Orchestrator has NO domain tools: The orchestrator must NOT have the same API/MCP tools as its subagents. If the orchestrator can query the data source, it will skip the planner and do it itself. Remove all domain-specific tools from the orchestrator — it only needs subagent, fs_read, fs_write, fs_list, and optionally shell for mkdir.
Hard prohibitions in orchestrator prompt: Soft language like "spawn subagent to do X" gets ignored. Use explicit prohibitions: "You MUST delegate ALL research to {planner}. You do NOT have database access. NEVER write to plan.md yourself."
Verify before chaining: After each subagent completes, the orchestrator should verify the expected output file exists before spawning the next subagent. E.g., check plan.md exists before spawning the writer.
Explicit skill references: Listing a skill in resources loads it into context, but the subagent's prompt should still say "Read {skill-path} and apply its rules." This ensures the agent actually uses the skill rather than ignoring loaded context.
No shell for evaluator: The evaluator should never execute code or commands — it only reads and scores.
Write paths scoped to workdir: All subagents write only to {workdir}/**. No access to .kiro/agents/, .kiro/skills/, or other system paths.
Config staging: Write agent configs to .kiro/data/ first (write-path-guard may block .kiro/agents/), then copy to .kiro/agents/ to activate.

Cross-Agent Context Sharing

Agents share context through the task directory's context/ folder:

Before planning, copy relevant outputs from other agents' task dirs into context/
Use filenames that indicate source: context/{source-agent}_{task-id}_summary.md
Previous iteration eval feedback goes to context/prev-eval.md
The orchestrator decides what context to surface — subagents should not reach outside their task dir

Evaluation Rubric Design

The rubric is defined per-task in goal.md. When scaffolding a new agent, define a default rubric in the orchestrator prompt that fits the agent's domain. Individual tasks can override it.

Rubric format

## Evaluation Rubric
| Dimension | Weight | What to check |
|-----------|--------|---------------|
| {name}    | {0.0-1.0} | {specific criteria the evaluator checks} |

Weights must sum to 1.0. If omitted, evaluator uses equal weights across all dimensions.

Rubric design guidelines

3-6 dimensions per rubric — fewer is better, each must be independently scorable
Weights reflect what matters most for this task — put weight on what's hardest to get right
"What to check" should be concrete and observable, not vague ("good quality")
Include at least one dimension for correctness/accuracy — the evaluator can't improve what it can't measure
For code-producing agents: consider dimensions like test coverage, type safety, no lint errors
For data-producing agents: consider dimensions like source traceability, completeness, freshness
For document-producing agents: consider dimensions like format compliance, clarity, audience fit
Plan Adherence is enforced by the evaluator as a mandatory check (not a rubric dimension) — the evaluator verifies every plan step has output, checks verification checklists, and tracks cumulative fix regressions across rounds. This is separate from the rubric to avoid diluting domain-specific dimensions, but regressions and plan deviations deduct from the most relevant rubric dimension (e.g., Completeness, Structure, or Format).

Example rubrics by domain

Code generation:

| Dimension | Weight | What to check |
|-----------|--------|---------------|
| Correctness | 0.35 | Code compiles, passes tests, handles edge cases |
| Completeness | 0.25 | All requirements addressed, no TODOs left |
| Code Quality | 0.20 | Idiomatic, readable, no duplication |
| Security | 0.20 | No hardcoded secrets, input validated, deps pinned |

Data pipeline:

| Dimension | Weight | What to check |
|-----------|--------|---------------|
| Data Accuracy | 0.35 | Output matches source data, no fabrication |
| Schema Compliance | 0.25 | Output matches expected schema exactly |
| Completeness | 0.25 | All records processed, no silent drops |
| Performance | 0.15 | Runs within expected time/resource bounds |

Report generation:

| Dimension | Weight | What to check |
|-----------|--------|---------------|
| Source Fidelity | 0.30 | Every claim traceable to a data source |
| Coverage | 0.25 | All required sections populated |
| Format | 0.20 | Follows template structure and style |
| Actionability | 0.25 | Recommendations are specific and feasible |

Conventions

Task IDs: {slug}-{8char_hex} format — slug is kebab-case from goal (e.g., blog-post-draft-a1b2c3d4)
To resume a task, user provides the task dir name — orchestrator detects progress and resumes from the right step
All agent output goes to files, not inline in prompts (keeps context clean)
Evaluator scores are integers 1-10 per rubric dimension; verdict requires every dimension to meet the threshold (not just the weighted average)
Rubric is defined in goal.md per task; agent-level default rubric goes in orchestrator prompt
pass_threshold and max_iterations: set agent defaults in orchestrator prompt, override per-task in goal.md Settings section. Verdict is per-dimension: every dimension must score >= threshold to PASS.
Evolution lessons use imperative form: "Do X when Y", "Avoid X because Y"

Lessons from Experience

Practical lessons learned from scaffolding and running harness agents (report-generator, code-reviewer).

Scaffolding

Do place workdirs under .kiro/data/{agent-name}/ when write-path guards restrict access to other directories. The .kiro/data/ path is typically always writable.
Do write agent configs to .kiro/agents/ directly when temp-approved, or stage in .kiro/data/ and copy if blocked.
Do seed evolution/lessons.md with domain-relevant starter lessons during scaffolding — the planner reads these from round 1.
Do seed evolution/experience-log.json with an empty array [] so the orchestrator can append without checking existence.

Orchestrator Prompt

Do explicitly list each subagent's capabilities in the orchestrator prompt (e.g., "report-planner has database read access and API access"). The orchestrator can't inspect subagent configs at runtime — it only knows what its own prompt tells it.
Do include hard prohibitions: "You do NOT have database access. You do NOT have API access." Soft language gets ignored — the orchestrator will try to do research itself if it has any ambiguity.
Do update the spawn prompt to tell the planner to use its specific tools: "Search the database for context. Query the API for live data." Generic prompts like "analyze the data" lead to the planner ignoring its MCP tools.
Do include the workdir path as a constant in the orchestrator prompt (e.g., Workdir: .kiro/data/my-agent) so it doesn't have to guess.

Subagent Design

Do give the planner all data-access MCP tools (database, API, etc.) and the worker/evaluator none. This enforces the research→produce→evaluate separation.
Do use @server/* wildcard for MCP tool access in tools and allowedTools when you want all tools from a server (e.g., @your-org.your-mcp-server/*).
Do add the data-source skill to the planner's resources when it has database access — the skill contains connection details and query patterns the planner needs.
Do include data source identifiers directly in the planner's prompt for the sources it should search. The planner won't discover them on its own.
Avoid giving the evaluator shell access — it should only read and score, never execute.
Avoid giving the orchestrator any MCP servers. If it can query the database or API, it will bypass the planner.

Rubric Design

Do design rubrics with 3-6 dimensions. More than 6 dilutes evaluator focus.
Do include a math/accuracy dimension for any agent that produces numerical output (cost estimates, metrics, calculations). The evaluator should spot-check arithmetic.
Do weight the hardest-to-get-right dimension highest — that's where iteration loops add the most value.

Plan Adherence & Regression Prevention

Do instruct the evaluator to check every numbered plan step against work/ output. Silent drops are worse than documented skips.
Do instruct the planner to include a "Verification Checklist" in plan.md for rounds > 1. The evaluator uses this as a machine-readable scoring input.
Do instruct the evaluator to build a cumulative fixes checklist from all prior history/round-*/eval.md and changelog.md files. A fix that was applied in round N and regressed in round N+1 should be scored more harshly than a first-time miss.
Avoid relying solely on prev-eval.md for continuity — it only carries the most recent round's feedback. The evaluator must scan the full history/ directory to catch regressions from any prior round.

Write Path Guards

When a preToolUse hook restricts write paths, check the allowed list before writing. Adjust the workdir to an allowed path rather than fighting the guard.
The pattern .kiro/data/** is a safe default for agent workdirs when other paths are restricted.