eval-harness-updater
SKILL.md
Eval Harness Updater
Refresh eval harnesses to keep live + fallback modes actionable under unstable environments.
Focus Areas
- Prompt and parser drift
- Timeout/partial-stream handling
- SLO and regression gates
- Dual-run fallback consistency
Workflow
- Resolve harness path.
- Research test/eval best practices (Exa + arXiv — see Research Gate below).
- Add RED regressions for parsing and timeout edge cases.
- Patch minimal harness logic.
- Validate eval outputs and CI gates.
- Resolve companion artifact gaps (see Cross-Reference table below).
Research Gate (Exa + arXiv — BOTH MANDATORY)
Before proposing harness changes, gather current best practices:
- Use Exa for implementation and ecosystem patterns:
mcp__Exa__web_search_exa({ query: 'LLM eval harness 2025 best practices' })mcp__Exa__get_code_context_exa({ query: 'eval harness parser reliability timeout handling' })
- Search arXiv for academic research on evaluation methodology (mandatory):
- Via Exa:
mcp__Exa__web_search_exa({ query: 'site:arxiv.org LLM evaluation harness 2024 2025' }) - Direct API:
WebFetch({ url: 'https://arxiv.org/search/?query=LLM+evaluation+harness&searchtype=all&start=0' })
- Via Exa:
- Record decisions, constraints, and non-goals in memory learnings.
arXiv is mandatory (not fallback) when topic involves: LLM evaluation, agent evaluation, SLO gates, regression testing methodology, or parser reliability.
Cross-Reference: Creator Ecosystem
This skill is part of the Creator Ecosystem. When research uncovers gaps, trigger the appropriate companion creator:
| Gap Discovered | Required Artifact | Creator to Invoke | When |
|---|---|---|---|
| Domain knowledge needs a reusable skill | skill | Skill({ skill: 'skill-creator' }) |
Gap is a full skill domain |
| Existing skill has incomplete coverage | skill update | Skill({ skill: 'skill-updater' }) |
Close skill exists but incomplete |
| Capability needs a dedicated agent | agent | Skill({ skill: 'agent-creator' }) |
Agent to own the capability |
| Existing agent needs capability update | agent update | Skill({ skill: 'agent-updater' }) |
Close agent exists but incomplete |
| Domain needs code/project scaffolding | template | Skill({ skill: 'template-creator' }) |
Reusable code patterns needed |
| Behavior needs pre/post execution guards | hook | Skill({ skill: 'hook-creator' }) |
Enforcement behavior required |
| Process needs multi-phase orchestration | workflow | Skill({ skill: 'workflow-creator' }) |
Multi-step coordination needed |
| Artifact needs structured I/O validation | schema | Skill({ skill: 'schema-creator' }) |
JSON schema for artifact I/O |
| User interaction needs a slash command | command | Skill({ skill: 'command-creator' }) |
User-facing shortcut needed |
| Repeated logic needs a reusable CLI tool | tool | Skill({ skill: 'tool-creator' }) |
CLI utility needed |
| Narrow/single-artifact capability only | inline | Document within this artifact only | Too specific to generalize |
Iron Laws
- ALWAYS run the Exa + arXiv research gate before updating any eval harness — updating without current external knowledge produces stale evaluation criteria.
- NEVER remove existing evaluation criteria without replacing them with equivalent or better ones — reducing test coverage in an eval harness is a regression.
- ALWAYS cross-reference the creator ecosystem for gaps before declaring the harness complete — missing companion artifacts (skills, agents, schemas) leave the harness unable to test new capabilities.
- NEVER update eval harness in isolation from the skill/agent it evaluates — harness and artifact must stay synchronized or the harness tests the wrong behavior.
- ALWAYS preserve backward compatibility in eval scoring — changing scoring semantics without migrating historical baselines makes trend analysis impossible.
Anti-Patterns
| Anti-Pattern | Why It Fails | Correct Approach |
|---|---|---|
| Updating eval harness without research gate | Criteria based on outdated knowledge; misses recent evaluation methodology advances | Always run Exa + arXiv research before updating any eval criteria |
| Removing test cases to simplify the harness | Silently reduces coverage; regressions pass undetected | Only remove test cases when the behavior they tested has been deliberately removed |
| Harness and artifact in separate PRs | Harness tests wrong behavior the moment artifact changes; immediate test drift | Always update harness and artifact in the same commit |
| Changing scoring scale mid-project | Historical baselines become incomparable; trend analysis breaks | Define scoring scale once; create a migration if it must change |
| Declaring harness complete without companion check | Missing skills or schemas leave evaluation gaps | Always run companion artifact check before marking harness update complete |
Memory Protocol (MANDATORY)
Before starting:
Read .claude/context/memory/learnings.md
After completing:
- New evaluation pattern →
.claude/context/memory/learnings.md - Evaluation gap found →
.claude/context/memory/issues.md - Scoring decision made →
.claude/context/memory/decisions.md
ASSUME INTERRUPTION: If it's not in memory, it didn't happen.
Weekly Installs
39
Repository
oimiragieo/agent-studioGitHub Stars
16
First Seen
Feb 19, 2026
Security Audits
Installed on
github-copilot39
gemini-cli39
cursor39
codex38
kimi-cli38
opencode38