skills/arisng/github-copilot-fc/harness-engineering-copilot

harness-engineering-copilot

SKILL.md

Harness Engineering for GitHub Copilot

Strategies and patterns for building high-leverage harnesses through Copilot's customization system. A harness is the scaffolding — context layering, constraint enforcement, entropy management, and feedback loops — that makes agents reliably productive.

Related skills:

  • harness-engineering — generic methodology (tool-agnostic).
  • agent-customization — Copilot file formats, syntax, YAML frontmatter, tool aliases, and configuration rules. Defer all "how do I write this file" questions there.

Sources: OpenAI — Harness Engineering | Martin Fowler — Harness Engineering

Use this skill in two modes:

  1. Design mode — decide how to layer context, split roles, and add enforcement.
  2. Audit mode — review an existing Copilot harness with fixed passes, explicit budgets, and consistency checks.

Strategy 1: Context Layering

Context is a scarce resource. The harness must deliver the right context at the right time — not dump everything into every interaction.

The Three-Tier Context Model

Tier Copilot primitive When it loads Budget What belongs here
Always-on copilot-instructions.md, path-scoped .instructions.md Every interaction (auto) ~100 lines repo-wide; ~50 lines per path scope Architecture map, boundary rules, pointer references
On-demand Agent skills (SKILL.md) When Copilot detects relevance <500 lines body + unlimited references Deep domain knowledge, schemas, runbooks
Task-scoped Prompt files (.prompt.md), agent prompts When user explicitly invokes No hard limit (30K chars for agents) Step-by-step workflows, checklists, batch operations

Cross-Check with agent-customization skill

Use the agent-customization decision flow to decide which primitive owns which layer instead of choosing files ad hoc.

Context layer Preferred primitive Why this fits agent-customization check
Always-on Workspace instructions Auto-loaded context should only hold stable routing rules and boundary constraints Keep applyTo narrow; avoid applyTo: "**" unless the rule truly applies everywhere
Always-on, path-specific File instructions Architectural rules should follow folder or module boundaries One instruction file per boundary, with explicit applyTo globs matching repo structure
On-demand Skills Deep reference material should load only when the task is relevant Put discovery phrases in description; keep body lean and offload depth to references/assets
Task-scoped Prompts Repeatable batch workflows should be explicit, not always resident in context Use prompts for single focused tasks with parameters
Task-scoped with isolation Custom agents Multi-stage work or restricted tool use needs isolated context and an enforcement contract Use an agent when you need delegation, context isolation, or different tool boundaries per stage

Before adding a new file, ask the agent-customization questions in this order:

  1. Scope: Is this workspace-shared behavior or a personal preference?
  2. Primitive: Is this always-on guidance, on-demand knowledge, a single focused prompt, or a delegated subagent workflow?
  3. Discovery: Will Copilot find it from the description, or are you relying on file names and hope?
  4. Load cost: Does this belong in auto-loaded context, or should it stay dormant until invoked?

If you cannot answer those four questions clearly, the layer boundary is still underspecified.

Context Layering Best Practices

  1. Map, don't manual. Always-on instructions should be a table of contents pointing to deeper sources elsewhere in the repo. If your copilot-instructions.md exceeds ~100 lines, you're overloading Tier 1.
  2. One instruction file per architectural boundary. Use applyTo glob patterns that mirror your module/layer structure. When an agent edits src/server/**, it should receive server-layer rules — not frontend rules.
  3. Skills as progressive disclosure. Move detailed reference material (DB schemas, API specs, deployment procedures) into skills with references/ folders. The SKILL.md body stays lean; the agent loads references/schema.md only when actually needed.
  4. Honor the primitive boundary. If the content is multi-step workflow logic, it probably belongs in a prompt or custom agent, not in always-on instructions. If it is durable domain knowledge, it probably belongs in a skill, not a prompt.
  5. Redundancy kills. Never duplicate content across tiers. If a convention is in copilot-instructions.md, don't repeat it in a skill. Use cross-references instead.
  6. Pointer chains, not deep nesting. Every reference should be at most one hop from an entry point. If an agent needs to follow AGENTS.md → docs/index.md → docs/design-docs/index.md → docs/design-docs/auth.md, that's too deep. Flatten to AGENTS.md → docs/design-docs/auth.md.
  7. Treat description as part of the harness. A skill or agent that cannot be discovered by its description is effectively missing from the harness, even if the file exists.

Context Freshness Strategy

Stale context is worse than no context — agents confidently follow outdated rules.

  • CI validation: Add a job that checks cross-references between instruction files and actual repo paths. If an instruction references docs/RELIABILITY.md and the file doesn't exist, fail the build.
  • Git-blame-based staleness: Flag instruction files not updated in >90 days alongside the code they govern. If backend.instructions.md hasn't changed but src/server/ has had 50 commits, something is probably stale.
  • Changelogs as triggers: When a significant architectural change merges, add "update harness" to the PR checklist. Treat instructions as code that must be kept in sync.

Strategy 2: Multi-Agent Constraint Enforcement

A single omniscient agent is an anti-pattern. Design an agent fleet (agent squad/agent swarm) where each agent has a narrow responsibility and minimal tool access.

The Observer / Actor / Maintainer Triad

Role Purpose Tool profile Invocation
Observer Detect violations, audit quality, report findings read + search only On-demand or scheduled
Actor Implement changes within enforced boundaries read + edit + search + execute Task-driven
Maintainer Fix drift, update docs, refactor toward golden principles read + edit + search Scheduled/periodic

Designing an Agent Fleet

Step 1: Identify enforcement surfaces. List every invariant you want to enforce (dependency direction, naming, logging, test coverage, doc freshness). Each surface maps to an Observer agent.

Step 2: Define tool boundaries. For each agent, ask: What's the minimum tool set needed? An architecture reviewer needs read + search, never edit. A test generator needs execute to run tests. Over-provisioning tools undermines the harness.

Step 3: Write agent prompts as enforcement contracts. The agent's markdown body is its enforcement contract — a precise specification of what to check, what to flag, and how to remediate. Structure every enforcement agent prompt as:

1. SCOPE: What files/modules to examine
2. RULES: Numbered invariants to check (reference docs/ for details)
3. PROCESS: Step-by-step verification procedure
4. OUTPUT: Exact format for findings (file, line, violation, remediation)
5. BOUNDARIES: What the agent must NOT do

Step 4: Chain agents for complex workflows. Use subagent invocation (agent tool alias) to compose:

  • Planner (read-only) → produces a plan
  • Implementor (full tools) → executes the plan
  • Reviewer (read-only) → validates the result

This mirrors the "Ralph Wiggum Loop" pattern: agents review each other's work in a feedback loop until all reviewers are satisfied.

Least-Privilege Patterns

Pattern Strategy When to use
Read-only observer tools: ["read", "search"] Auditing, reviewing, scanning
Edit-no-execute tools: ["read", "edit", "search"] Documentation updates, config changes, refactoring
Full actor tools: ["read", "edit", "search", "execute"] Implementation requiring test runs or builds
Scoped MCP tools: ["read", "search", "playwright/screenshot"] Agents that need one specific external capability
Subagent-only tools: ["read", "search", "agent"] Orchestrators that delegate all work

Runtime-Specific Tool Assignment

Use runtime to describe the Copilot execution environment of the agent file: VS Code, GitHub Copilot coding agent, CLI-backed coding flows, background agents, cloud agents, or an SDK-hosted flow that adopts one of those schemas. Use platform for OS scope such as Windows, WSL/Linux, or macOS.

When the same specialist role needs different frontmatter, tool namespaces, or delegation mechanics across runtimes, create an agent variant: a runtime-specific .agent.md wrapper over shared instructions.

Official Runtime Facts to Design Around

Fact Design implication
target is officially vscode or github-copilot Scope each variant to the runtime family it is meant for instead of assuming one file behaves identically everywhere
tools defaults to all tools when omitted; tools: [] disables all tools Omit tools only when you genuinely want broad capability; otherwise whitelist aggressively
Unrecognized tool names are ignored A mixed-runtime tool list can fail silently; audit tool names per runtime rather than trusting parse success
disable-model-invocation is the canonical control and infer is retired Use the new field when deciding whether a variant is user-selectable, subagent-only, or both
VS Code exposes agents, argument-hint, handoffs, and richer tool discovery Keep VS Code orchestration and guided transitions in the VS Code variant instead of leaking them into shared instructions
GitHub Copilot runtime supports mcp-servers and namespaced MCP tools Put runtime-specific MCP wiring in the github-copilot variant, not in the shared behavior contract

Agent Variant Pattern

Split each specialized agent into two layers:

  1. Shared instruction layer: persona, rules, workflow, artifacts, output contract, and tool-agnostic wording.
  2. Runtime variant layer: target, tools, agents, handoffs, mcp-servers, and any runtime-specific tool references.

The shared layer should say what capability is required, not which runtime-specific tool name to call. Write "read the file", "run tests", or "delegate to the reviewer" rather than embedding a runtime-specific token.

Tool Assignment by Runtime Family

Use the official target values in the variant frontmatter: vscode or github-copilot.

Runtime family Frontmatter focus Tool assignment rule Delegation model
VS Code target: vscode, optional agents, argument-hint, handoffs, model Use VS Code tool names, toolsets, extension tools, or MCP namespaced tools. If agents is specified, include the agent tool. Explicit subagent wiring via agents
GitHub Copilot target: github-copilot, optional mcp-servers, disable-model-invocation, user-invocable Prefer official aliases such as read, edit, search, execute, agent, plus namespaced MCP tools like playwright/* or github/* Delegation through model invocation and custom-agent/task tooling
Background or cloud agent flow Usually inherits the VS Code custom-agent model Reuse a VS Code-oriented variant unless the host removes or overrides tools Same as host runtime
SDK-hosted flow Treat as host-defined until proven otherwise Do not assume .agent.md fields or tool names map 1:1; align the variant to the runtime actually consuming it Host-dependent

Assignment Heuristics

  1. Start from the minimum capability set needed for the role.
  2. Assign tools per runtime, not per persona. The reviewer persona may be read-only in every runtime, but the concrete tool names still differ.
  3. Put MCP server selection in the variant that can actually load it.
  4. Keep body text free of runtime-specific tool names unless the file is explicitly runtime-locked.
  5. Re-measure orchestration after tool changes. A variant that adds subagent access, shell execution, or MCP tools has changed its effective safety boundary.

Failure Modes to Audit

  • A VS Code-only tool name in a github-copilot variant silently disappears.
  • A shared instruction body mentions a tool token that only exists in one runtime.
  • A runtime variant inherits an all-tools default because tools was omitted unintentionally.
  • A subagent-only variant still allows model invocation because disable-model-invocation was not set.
  • A team treats Windows, Linux, or macOS as the runtime boundary when the real difference is the Copilot host.

Prompt Engineering for Enforcement

Agent prompts in a harness serve a different purpose than general-purpose prompts. They are mechanical contracts, not creative guidance.

  • Be exhaustive about rules, terse about explanation. Don't explain why a rule exists in the agent prompt — put that in docs. The prompt should list exactly what to check.
  • Include remediation in the output format. When an observer reports a violation, it should include the fix instruction. This becomes context for the actor agent that fixes it.
  • Avoid open-ended judgment calls. "Review code quality" is too vague. "Check that every public function in src/server/ has a corresponding test in tests/server/" is enforceable.
  • Reference, don't inline. Agent prompts should point to ARCHITECTURE.md or docs/conventions/ for rule details rather than reproducing them. This prevents the prompt from going stale independently.

Strategy 3: Operational Audit Protocols

Harnesses usually fail in the operational details, not in the high-level design. Convert repo-specific review checklists into a reusable audit protocol with explicit passes and measurable outputs.

The Five Audit Passes

Pass Question Typical checks
Structural contract Does each instruction or agent file contain the sections the workflow depends on? Required sections present, numbered rules complete, one shared protocol block instead of duplicates, output contracts match referenced artifacts
Budget discipline Is every prompt body below the runtime limit with safety margin? Soft ceiling below hard max, frontmatter excluded from measurement, embedded agent body re-measured after every instruction change
Consistency Do names, signals, paths, and schemas line up across files? Exact agent names, exact delegation targets, one signal vocabulary, one path convention, no deprecated frontmatter keys
Ownership Does every writable artifact have exactly one owner? Planner-only artifacts stay planner-owned, shared progress artifacts are append-only, reviewers observe rather than overwrite
Distribution Will the runtime load this deterministically? Build-time embedding or bundling, no runtime instruction reads for CLI-only bodies, official plugin schema only

Audit Output Contract

Every audit pass should report findings in a fixed shape:

Field Meaning
Artifact File, folder, or workflow surface being checked
Invariant Exact rule that failed
Evidence Concrete mismatch: missing section, wrong name, over-budget body, conflicting owner
Remediation Smallest change that restores consistency
Re-run trigger What future edit should force this pass to run again

Compression Is a Harness Capability

Prompt compression is not cosmetic. It preserves budget for task context and reduces drift.

Always remove:

  • Motivational preambles before the real contract
  • Inline explanations of why a rule exists
  • Duplicated protocol blocks repeated per mode
  • Example output blocks that only restate an existing schema
  • Verbose decision trees better expressed as flat steps

Always preserve:

  • All numbered rules in the enforcement contract
  • Every supported mode name and its complete step list
  • Exact path patterns and naming conventions
  • Contract field names, required/optional status, and signal schema
  • Preflight gates and failure-mode handling

What to Measure

For every auditable harness, keep a small measurement set close at hand:

  1. Prompt body length after frontmatter stripping.
  2. Combined wrapper-plus-embedded body length for distributed agents.
  3. Count of active signal types and whether all files use the same schema.
  4. Artifact ownership table: one writer per artifact, explicit shared append-only logs.
  5. Revalidation triggers: which edits force which audit passes.

Strategy 4: Entropy Management

Every agent-generated line of code can introduce drift. Entropy management is the discipline of detecting and correcting drift before it compounds.

Golden Principles Framework

Define a small set (5–10) of non-negotiable, mechanically verifiable rules:

# Principle Verification method
1 Shared utilities over hand-rolled helpers Lint: flag duplicate utility patterns
2 Parse data at boundaries, never YOLO-probe Lint: detect untyped API calls
3 Structured logging with correlation IDs Lint: flag console.log / raw print
4 One module = one domain, no cross-domain imports Structural test: import graph validation
5 Every public API has a test Coverage check: map exports → test files

Key insight: Golden principles must be verifiable by linters, structural tests, or agents — not just documented. If you can't automate the check, it's a guideline, not a golden principle.

Garbage Collection Cadence

Frequency What to check Agent type
Per-commit (CI) Lint rules, structural tests, doc cross-references Deterministic (linters/tests)
Daily Doc freshness, quality score drift, stale TODOs Maintainer agent
Weekly Golden principle deviations, pattern duplication, tech debt inventory Maintainer agent, batch prompt
Per-sprint Full quality scoring across all domains/layers Observer agent + human review

Quality Scoring Pattern

Maintain a versioned QUALITY_SCORE.md that grades each domain and layer:

| Domain | Types | Config | Service | Tests | Docs | Overall |
|--------|-------|--------|---------|-------|------|---------|
| Auth   | A     | A      | B       | B     | C    | B       |
| Billing| A     | B      | B       | C     | D    | C+      |
| Search | B     | B      | C       | D     | F    | D+      |

This gives both humans and agents a map of where debt lives. A maintainer agent can read this and prioritize: "Search.Docs is F — generate missing documentation for the Search domain."

Entropy Detection Strategies

  1. Pattern divergence scan. Compare how a pattern is implemented across modules. If 8 out of 10 services use structured logging but 2 use console.log, flag the outliers.
  2. Doc/code freshness ratio. If src/billing/ changed 30 times this month but docs/billing.md changed 0 times, the docs are likely stale.
  3. Lint error message engineering. Write custom lint error messages that include remediation instructions. When a linter reports "Import from auth/ in billing/ violates boundary — move shared types to shared/types/", the agent can act on it directly. The error message is the agent's instruction.
  4. Snapshot diffing. Periodically generate a snapshot of the repo structure (file tree, import graph, export map) and diff against the previous snapshot. Large deltas without corresponding doc updates signal drift.

Strategy 5: Application Legibility

Agents can only verify what they can observe. Extending the agent's senses beyond static files dramatically increases harness leverage.

Legibility Layers

Layer What the agent can see Copilot mechanism Harness value
Static files Source code, docs, config Built-in (read/search) Baseline — always available
Build/test output Compilation errors, test results, lint output execute tool Validates correctness
Browser state DOM, screenshots, navigation, console errors MCP (Playwright) UI verification without manual testing
Runtime telemetry Logs, metrics, traces Custom MCP server Performance and reliability validation
Repository state Issues, PRs, CI status, branch state MCP (GitHub) Workflow-aware decisions

Legibility Best Practices

  1. Make the app bootable per task. If an agent can't start the application in isolation, it can't validate its own work. Design for single-command startup (Docker Compose, Aspire, etc.).
  2. Wire browser automation for UI verification. An agent that edits a Blazor component should be able to screenshot the result. This closes the feedback loop without human eyes.
  3. Expose observability to the agent. If you have logs and metrics, make them queryable. An agent instruction like "ensure no error logs during startup" is only enforceable if the agent can read the logs.
  4. Ephemeral environments. Each agent task should run against an isolated instance. Shared environments introduce cross-task interference that agents can't reason about.

Strategy 6: Harness Evolution

A harness is not a one-time setup. It evolves with the codebase.

Maturity Model

Level Context Constraints Entropy Legibility
1 — Ad-hoc No instructions No enforcement No cleanup Static files only
2 — Documented copilot-instructions.md exists Conventions documented but not enforced Manual cleanup Build/test output
3 — Scoped Path-scoped instructions + skills Linters catch some violations Weekly batch prompts Browser automation
4 — Enforced Full three-tier context Structural tests + CI gates block violations Daily maintainer agents Runtime telemetry
5 — Self-healing Agent-maintained instructions Agents detect and fix violations autonomously Continuous GC with quality scoring Full stack legibility

Evolution Workflow

  1. Assess. Use the harness assessment from the harness-engineering skill. Rate each component 0–2.
  2. Target one level up. Don't jump from Level 1 to Level 5. Move one level at a time per component.
  3. Instrument before enforcing. Before adding a linter, add an observer agent to measure the current state. Quantify violations before blocking them.
  4. Encode human taste continuously. Every time a human reviews agent output and says "that's not how we do it here," capture the rule — as a lint rule, instruction update, or golden principle. Never rely on the same correction twice.
  5. Retroactively harness brownfield code. Start with context engineering (cheapest, highest ROI), add constraints per-module as you touch them, and add entropy management last.

Revalidation Triggers

Tie harness maintenance to concrete change events instead of vague periodic reviews.

Change Re-run
Instruction body changed Structural contract, budget discipline, consistency
Agent frontmatter or delegation changed Consistency, distribution
Signal schema changed Consistency across every agent and instruction
Workflow paths changed Structural contract, ownership, consistency
Plugin or bundle pipeline changed Distribution and budget discipline

If a harness cannot tell you which checks to rerun after a change, it is still too implicit.

Scaling Across Repositories

For organizations with multiple repos:

  • Harness template repos. Create starter repo templates with pre-configured instruction files, standard agent fleet, and CI validation jobs. This parallels "golden path" service templates, optimized for agent-driven development.
  • Org-level agents. Define organization-wide enforcement agents (via .github-private repo) that apply everywhere. Keep repo-level agents for domain-specific concerns.
  • Shared skills as packages. Extract common skills (architecture validation, deployment runbooks) into a shared repo and publish to personal skill directories for cross-repo reuse.
  • Federated golden principles. Maintain a core set of golden principles organization-wide, with repo-specific extensions. Enforce the core set from org-level; let repos own their extensions.

Anti-Patterns

Anti-pattern Why it fails Better strategy
Monolithic instructions Crowds out task context at 1000+ lines; rots fast Three-tier context layering with 100-line Tier 1
One omniscient agent No tool boundaries; can't reason about everything at once Agent fleet with Observer / Actor / Maintainer roles
Duplicate context Same rule in instructions, skills, and agent prompts diverges over time Single source of truth with cross-references
Verbal-only enforcement "We always do X" isn't legible to agents Encode in linters, tests, or observer agents
Big-bang cleanup 20% of the week on "AI slop" doesn't scale Continuous garbage collection at daily cadence
Static-only legibility Agent can't verify its own UI changes or performance Wire browser automation and observability MCP
Set-and-forget harness Codebase evolves, harness doesn't Treat instructions as code; CI-check freshness
Open-ended agent contracts "Review code quality" produces inconsistent results Precise enforcement contracts with numbered rules
Checklist-free maintenance Review quality depends on memory and reviewer taste Encode reusable audit passes with fixed outputs and triggers
Weekly Installs
1
GitHub Stars
2
First Seen
7 days ago
Installed on
zencoder1
amp1
cline1
openclaw1
opencode1
cursor1