harness-engineering-copilot

Installation

SKILL.md

Harness Engineering for GitHub Copilot

Strategies and patterns for building high-leverage harnesses through Copilot's customization system. A harness is the scaffolding — context layering, constraint enforcement, entropy management, and feedback loops — that makes agents reliably productive.

Related skills:

harness-engineering — generic methodology (tool-agnostic).
agent-customization — Copilot file formats, syntax, YAML frontmatter, tool aliases, and configuration rules. Defer all "how do I write this file" questions there.

Sources: OpenAI — Harness Engineering | Martin Fowler — Harness Engineering

Use this skill in two modes:

Design mode — decide how to layer context, split roles, and add enforcement.
Audit mode — review an existing Copilot harness with fixed passes, explicit budgets, and consistency checks.

Strategy 1: Context Layering

Context is a scarce resource. The harness must deliver the right context at the right time — not dump everything into every interaction.

The Three-Tier Context Model

Tier	Copilot primitive	When it loads	Budget	What belongs here
Always-on	`copilot-instructions.md`, path-scoped `.instructions.md`	Every interaction (auto)	~100 lines repo-wide; ~50 lines per path scope	Architecture map, boundary rules, pointer references
On-demand	Agent skills (`SKILL.md`)	When Copilot detects relevance	<500 lines body + unlimited references	Deep domain knowledge, schemas, runbooks
Task-scoped	Prompt files (`.prompt.md`), agent prompts	When user explicitly invokes	No hard limit (30K chars for agents)	Step-by-step workflows, checklists, batch operations

Cross-Check with `agent-customization` skill

Use the agent-customization decision flow to decide which primitive owns which layer instead of choosing files ad hoc.

Context layer	Preferred primitive	Why this fits	`agent-customization` check
Always-on	Workspace instructions	Auto-loaded context should only hold stable routing rules and boundary constraints	Keep `applyTo` narrow; avoid `applyTo: "**"` unless the rule truly applies everywhere
Always-on, path-specific	File instructions	Architectural rules should follow folder or module boundaries	One instruction file per boundary, with explicit `applyTo` globs matching repo structure
On-demand	Skills	Deep reference material should load only when the task is relevant	Put discovery phrases in `description`; keep body lean and offload depth to references/assets
Task-scoped	Prompts	Repeatable batch workflows should be explicit, not always resident in context	Use prompts for single focused tasks with parameters
Task-scoped with isolation	Custom agents	Multi-stage work or restricted tool use needs isolated context and an enforcement contract	Use an agent when you need delegation, context isolation, or different tool boundaries per stage

Before adding a new file, ask the agent-customization questions in this order:

Scope: Is this workspace-shared behavior or a personal preference?
Primitive: Is this always-on guidance, on-demand knowledge, a single focused prompt, or a delegated subagent workflow?
Discovery: Will Copilot find it from the description, or are you relying on file names and hope?
Load cost: Does this belong in auto-loaded context, or should it stay dormant until invoked?

If you cannot answer those four questions clearly, the layer boundary is still underspecified.

Context Layering Best Practices

Map, don't manual. Always-on instructions should be a table of contents pointing to deeper sources elsewhere in the repo. If your copilot-instructions.md exceeds ~100 lines, you're overloading Tier 1.
One instruction file per architectural boundary. Use applyTo glob patterns that mirror your module/layer structure. When an agent edits src/server/**, it should receive server-layer rules — not frontend rules.
Skills as progressive disclosure. Move detailed reference material (DB schemas, API specs, deployment procedures) into skills with references/ folders. The SKILL.md body stays lean; the agent loads references/schema.md only when actually needed.
Honor the primitive boundary. If the content is multi-step workflow logic, it probably belongs in a prompt or custom agent, not in always-on instructions. If it is durable domain knowledge, it probably belongs in a skill, not a prompt.
Redundancy kills. Never duplicate content across tiers. If a convention is in copilot-instructions.md, don't repeat it in a skill. Use cross-references instead.
Pointer chains, not deep nesting. Every reference should be at most one hop from an entry point. If an agent needs to follow AGENTS.md → docs/index.md → docs/design-docs/index.md → docs/design-docs/auth.md, that's too deep. Flatten to AGENTS.md → docs/design-docs/auth.md.
Treat description as part of the harness. A skill or agent that cannot be discovered by its description is effectively missing from the harness, even if the file exists.

Context Freshness Strategy

Stale context is worse than no context — agents confidently follow outdated rules.

CI validation: Add a job that checks cross-references between instruction files and actual repo paths. If an instruction references docs/RELIABILITY.md and the file doesn't exist, fail the build.
Git-blame-based staleness: Flag instruction files not updated in >90 days alongside the code they govern. If backend.instructions.md hasn't changed but src/server/ has had 50 commits, something is probably stale.
Changelogs as triggers: When a significant architectural change merges, add "update harness" to the PR checklist. Treat instructions as code that must be kept in sync.

Strategy 2: Multi-Agent Constraint Enforcement

A single omniscient agent is an anti-pattern. Design an agent fleet (agent squad/agent swarm) where each agent has a narrow responsibility and minimal tool access.

The Observer / Actor / Maintainer Triad

Role	Purpose	Tool profile	Invocation
Observer	Detect violations, audit quality, report findings	`read` + `search` only	On-demand or scheduled
Actor	Implement changes within enforced boundaries	`read` + `edit` + `search` + `execute`	Task-driven
Maintainer	Fix drift, update docs, refactor toward golden principles	`read` + `edit` + `search`	Scheduled/periodic

Designing an Agent Fleet

Step 1: Identify enforcement surfaces. List every invariant you want to enforce (dependency direction, naming, logging, test coverage, doc freshness). Each surface maps to an Observer agent.

Step 2: Define tool boundaries. For each agent, ask: What's the minimum tool set needed? An architecture reviewer needs read + search, never edit. A test generator needs execute to run tests. Over-provisioning tools undermines the harness.

Step 3: Write agent prompts as enforcement contracts. The agent's markdown body is its enforcement contract — a precise specification of what to check, what to flag, and how to remediate. Structure every enforcement agent prompt as:

1. SCOPE: What files/modules to examine
2. RULES: Numbered invariants to check (reference docs/ for details)
3. PROCESS: Step-by-step verification procedure
4. OUTPUT: Exact format for findings (file, line, violation, remediation)
5. BOUNDARIES: What the agent must NOT do

Step 4: Chain agents for complex workflows. Use subagent invocation (agent tool alias) to compose:

Planner (read-only) → produces a plan
Implementor (full tools) → executes the plan
Reviewer (read-only) → validates the result

This mirrors the "Ralph Wiggum Loop" pattern: agents review each other's work in a feedback loop until all reviewers are satisfied.

Least-Privilege Patterns

Pattern	Strategy	When to use
Read-only observer	`tools: ["read", "search"]`	Auditing, reviewing, scanning
Edit-no-execute	`tools: ["read", "edit", "search"]`	Documentation updates, config changes, refactoring
Full actor	`tools: ["read", "edit", "search", "execute"]`	Implementation requiring test runs or builds
Scoped MCP	`tools: ["read", "search", "playwright/screenshot"]`	Agents that need one specific external capability
Subagent-only	`tools: ["read", "search", "agent"]`	Orchestrators that delegate all work

Runtime-Specific Tool Assignment

Use runtime to describe the Copilot execution environment of the agent file: VS Code, GitHub Copilot coding agent, CLI-backed coding flows, background agents, cloud agents, or an SDK-hosted flow that adopts one of those schemas. Use platform for OS scope such as Windows, WSL/Linux, or macOS.

When the same specialist role needs different frontmatter, tool namespaces, or delegation mechanics across runtimes, create an agent variant: a runtime-specific .agent.md wrapper over shared instructions.

Official Runtime Facts to Design Around

Fact	Design implication
`target` is officially `vscode` or `github-copilot`	Scope each variant to the runtime family it is meant for instead of assuming one file behaves identically everywhere
`tools` defaults to all tools when omitted; `tools: []` disables all tools	Omit `tools` only when you genuinely want broad capability; otherwise whitelist aggressively
Unrecognized tool names are ignored	A mixed-runtime tool list can fail silently; audit tool names per runtime rather than trusting parse success
`disable-model-invocation` is the canonical control and `infer` is retired	Use the new field when deciding whether a variant is user-selectable, subagent-only, or both
VS Code exposes `agents`, `argument-hint`, `handoffs`, and richer tool discovery	Keep VS Code orchestration and guided transitions in the VS Code variant instead of leaking them into shared instructions
GitHub Copilot runtime supports `mcp-servers` and namespaced MCP tools	Put runtime-specific MCP wiring in the `github-copilot` variant, not in the shared behavior contract

Agent Variant Pattern

Split each specialized agent into two layers:

Shared instruction layer: persona, rules, workflow, artifacts, output contract, and tool-agnostic wording.
Runtime variant layer: target, tools, agents, handoffs, mcp-servers, and any runtime-specific tool references.

The shared layer should say what capability is required, not which runtime-specific tool name to call. Write "read the file", "run tests", or "delegate to the reviewer" rather than embedding a runtime-specific token.

Tool Assignment by Runtime Family

Use the official target values in the variant frontmatter: vscode or github-copilot.

Runtime family	Frontmatter focus	Tool assignment rule	Delegation model
VS Code	`target: vscode`, optional `agents`, `argument-hint`, `handoffs`, `model`	Use VS Code tool names, toolsets, extension tools, or MCP namespaced tools. If `agents` is specified, include the `agent` tool.	Explicit subagent wiring via `agents`
GitHub Copilot	`target: github-copilot`, optional `mcp-servers`, `disable-model-invocation`, `user-invocable`	Prefer official aliases such as `read`, `edit`, `search`, `execute`, `agent`, plus namespaced MCP tools like `playwright/` or `github/`	Delegation through model invocation and custom-agent/task tooling
Background or cloud agent flow	Usually inherits the VS Code custom-agent model	Reuse a VS Code-oriented variant unless the host removes or overrides tools	Same as host runtime
SDK-hosted flow	Treat as host-defined until proven otherwise	Do not assume `.agent.md` fields or tool names map 1:1; align the variant to the runtime actually consuming it	Host-dependent

Assignment Heuristics

Start from the minimum capability set needed for the role.
Assign tools per runtime, not per persona. The reviewer persona may be read-only in every runtime, but the concrete tool names still differ.
Put MCP server selection in the variant that can actually load it.
Keep body text free of runtime-specific tool names unless the file is explicitly runtime-locked.
Re-measure orchestration after tool changes. A variant that adds subagent access, shell execution, or MCP tools has changed its effective safety boundary.

Failure Modes to Audit

A VS Code-only tool name in a github-copilot variant silently disappears.
A shared instruction body mentions a tool token that only exists in one runtime.
A runtime variant inherits an all-tools default because tools was omitted unintentionally.
A subagent-only variant still allows model invocation because disable-model-invocation was not set.
A team treats Windows, Linux, or macOS as the runtime boundary when the real difference is the Copilot host.

Prompt Engineering for Enforcement

Agent prompts in a harness serve a different purpose than general-purpose prompts. They are mechanical contracts, not creative guidance.

Be exhaustive about rules, terse about explanation. Don't explain why a rule exists in the agent prompt — put that in docs. The prompt should list exactly what to check.
Include remediation in the output format. When an observer reports a violation, it should include the fix instruction. This becomes context for the actor agent that fixes it.
Avoid open-ended judgment calls. "Review code quality" is too vague. "Check that every public function in src/server/ has a corresponding test in tests/server/" is enforceable.
Reference, don't inline. Agent prompts should point to ARCHITECTURE.md or docs/conventions/ for rule details rather than reproducing them. This prevents the prompt from going stale independently.

Strategy 3: Operational Audit Protocols

Harnesses usually fail in the operational details, not in the high-level design. Convert repo-specific review checklists into a reusable audit protocol with explicit passes and measurable outputs.

The Five Audit Passes

Pass	Question	Typical checks
Structural contract	Does each instruction or agent file contain the sections the workflow depends on?	Required sections present, numbered rules complete, one shared protocol block instead of duplicates, output contracts match referenced artifacts
Budget discipline	Is every prompt body below the runtime limit with safety margin?	Soft ceiling below hard max, frontmatter excluded from measurement, embedded agent body re-measured after every instruction change
Consistency	Do names, signals, paths, and schemas line up across files?	Exact agent names, exact delegation targets, one signal vocabulary, one path convention, no deprecated frontmatter keys
Ownership	Does every writable artifact have exactly one owner?	Planner-only artifacts stay planner-owned, shared progress artifacts are append-only, reviewers observe rather than overwrite
Distribution	Will the runtime load this deterministically?	Build-time embedding or bundling, no runtime instruction reads for CLI-only bodies, official plugin schema only

Audit Output Contract

Every audit pass should report findings in a fixed shape:

Field	Meaning
Artifact	File, folder, or workflow surface being checked
Invariant	Exact rule that failed
Evidence	Concrete mismatch: missing section, wrong name, over-budget body, conflicting owner
Remediation	Smallest change that restores consistency
Re-run trigger	What future edit should force this pass to run again

Compression Is a Harness Capability

Prompt compression is not cosmetic. It preserves budget for task context and reduces drift.

Always remove:

Motivational preambles before the real contract
Inline explanations of why a rule exists
Duplicated protocol blocks repeated per mode
Example output blocks that only restate an existing schema
Verbose decision trees better expressed as flat steps

Always preserve:

All numbered rules in the enforcement contract
Every supported mode name and its complete step list
Exact path patterns and naming conventions
Contract field names, required/optional status, and signal schema
Preflight gates and failure-mode handling

What to Measure

For every auditable harness, keep a small measurement set close at hand:

Prompt body length after frontmatter stripping.
Combined wrapper-plus-embedded body length for distributed agents.
Count of active signal types and whether all files use the same schema.
Artifact ownership table: one writer per artifact, explicit shared append-only logs.
Revalidation triggers: which edits force which audit passes.

Strategy 4: Entropy Management

Every agent-generated line of code can introduce drift. Entropy management is the discipline of detecting and correcting drift before it compounds.

Golden Principles Framework

Define a small set (5–10) of non-negotiable, mechanically verifiable rules:

#	Principle	Verification method
1	Shared utilities over hand-rolled helpers	Lint: flag duplicate utility patterns
2	Parse data at boundaries, never YOLO-probe	Lint: detect untyped API calls
3	Structured logging with correlation IDs	Lint: flag `console.log` / raw `print`
4	One module = one domain, no cross-domain imports	Structural test: import graph validation
5	Every public API has a test	Coverage check: map exports → test files

Key insight: Golden principles must be verifiable by linters, structural tests, or agents — not just documented. If you can't automate the check, it's a guideline, not a golden principle.

Garbage Collection Cadence

Frequency	What to check	Agent type
Per-commit (CI)	Lint rules, structural tests, doc cross-references	Deterministic (linters/tests)
Daily	Doc freshness, quality score drift, stale TODOs	Maintainer agent
Weekly	Golden principle deviations, pattern duplication, tech debt inventory	Maintainer agent, batch prompt
Per-sprint	Full quality scoring across all domains/layers	Observer agent + human review

Quality Scoring Pattern

Maintain a versioned QUALITY_SCORE.md that grades each domain and layer:

| Domain | Types | Config | Service | Tests | Docs | Overall |
|--------|-------|--------|---------|-------|------|---------|
| Auth   | A     | A      | B       | B     | C    | B       |
| Billing| A     | B      | B       | C     | D    | C+      |
| Search | B     | B      | C       | D     | F    | D+      |

This gives both humans and agents a map of where debt lives. A maintainer agent can read this and prioritize: "Search.Docs is F — generate missing documentation for the Search domain."

Entropy Detection Strategies

Pattern divergence scan. Compare how a pattern is implemented across modules. If 8 out of 10 services use structured logging but 2 use console.log, flag the outliers.
Doc/code freshness ratio. If src/billing/ changed 30 times this month but docs/billing.md changed 0 times, the docs are likely stale.
Lint error message engineering. Write custom lint error messages that include remediation instructions. When a linter reports "Import from auth/ in billing/ violates boundary — move shared types to shared/types/", the agent can act on it directly. The error message is the agent's instruction.
Snapshot diffing. Periodically generate a snapshot of the repo structure (file tree, import graph, export map) and diff against the previous snapshot. Large deltas without corresponding doc updates signal drift.

Strategy 5: Application Legibility

Agents can only verify what they can observe. Extending the agent's senses beyond static files dramatically increases harness leverage.

Legibility Layers

Layer	What the agent can see	Copilot mechanism	Harness value
Static files	Source code, docs, config	Built-in (read/search)	Baseline — always available
Build/test output	Compilation errors, test results, lint output	`execute` tool	Validates correctness
Browser state	DOM, screenshots, navigation, console errors	MCP (Playwright)	UI verification without manual testing
Runtime telemetry	Logs, metrics, traces	Custom MCP server	Performance and reliability validation
Repository state	Issues, PRs, CI status, branch state	MCP (GitHub)	Workflow-aware decisions

Legibility Best Practices

Make the app bootable per task. If an agent can't start the application in isolation, it can't validate its own work. Design for single-command startup (Docker Compose, Aspire, etc.).
Wire browser automation for UI verification. An agent that edits a Blazor component should be able to screenshot the result. This closes the feedback loop without human eyes.
Expose observability to the agent. If you have logs and metrics, make them queryable. An agent instruction like "ensure no error logs during startup" is only enforceable if the agent can read the logs.
Ephemeral environments. Each agent task should run against an isolated instance. Shared environments introduce cross-task interference that agents can't reason about.

Strategy 6: Harness Evolution

A harness is not a one-time setup. It evolves with the codebase.

Maturity Model

Level	Context	Constraints	Entropy	Legibility
1 — Ad-hoc	No instructions	No enforcement	No cleanup	Static files only
2 — Documented	`copilot-instructions.md` exists	Conventions documented but not enforced	Manual cleanup	Build/test output
3 — Scoped	Path-scoped instructions + skills	Linters catch some violations	Weekly batch prompts	Browser automation
4 — Enforced	Full three-tier context	Structural tests + CI gates block violations	Daily maintainer agents	Runtime telemetry
5 — Self-healing	Agent-maintained instructions	Agents detect and fix violations autonomously	Continuous GC with quality scoring	Full stack legibility

Evolution Workflow

Assess. Use the harness assessment from the harness-engineering skill. Rate each component 0–2.
Target one level up. Don't jump from Level 1 to Level 5. Move one level at a time per component.
Instrument before enforcing. Before adding a linter, add an observer agent to measure the current state. Quantify violations before blocking them.
Encode human taste continuously. Every time a human reviews agent output and says "that's not how we do it here," capture the rule — as a lint rule, instruction update, or golden principle. Never rely on the same correction twice.
Retroactively harness brownfield code. Start with context engineering (cheapest, highest ROI), add constraints per-module as you touch them, and add entropy management last.

Revalidation Triggers

Tie harness maintenance to concrete change events instead of vague periodic reviews.

Change	Re-run
Instruction body changed	Structural contract, budget discipline, consistency
Agent frontmatter or delegation changed	Consistency, distribution
Signal schema changed	Consistency across every agent and instruction
Workflow paths changed	Structural contract, ownership, consistency
Plugin or bundle pipeline changed	Distribution and budget discipline

If a harness cannot tell you which checks to rerun after a change, it is still too implicit.

Scaling Across Repositories

For organizations with multiple repos:

Harness template repos. Create starter repo templates with pre-configured instruction files, standard agent fleet, and CI validation jobs. This parallels "golden path" service templates, optimized for agent-driven development.
Org-level agents. Define organization-wide enforcement agents (via .github-private repo) that apply everywhere. Keep repo-level agents for domain-specific concerns.
Shared skills as packages. Extract common skills (architecture validation, deployment runbooks) into a shared repo and publish to personal skill directories for cross-repo reuse.
Federated golden principles. Maintain a core set of golden principles organization-wide, with repo-specific extensions. Enforce the core set from org-level; let repos own their extensions.

Anti-Patterns

Anti-pattern	Why it fails	Better strategy
Monolithic instructions	Crowds out task context at 1000+ lines; rots fast	Three-tier context layering with 100-line Tier 1
One omniscient agent	No tool boundaries; can't reason about everything at once	Agent fleet with Observer / Actor / Maintainer roles
Duplicate context	Same rule in instructions, skills, and agent prompts diverges over time	Single source of truth with cross-references
Verbal-only enforcement	"We always do X" isn't legible to agents	Encode in linters, tests, or observer agents
Big-bang cleanup	20% of the week on "AI slop" doesn't scale	Continuous garbage collection at daily cadence
Static-only legibility	Agent can't verify its own UI changes or performance	Wire browser automation and observability MCP
Set-and-forget harness	Codebase evolves, harness doesn't	Treat instructions as code; CI-check freshness
Open-ended agent contracts	"Review code quality" produces inconsistent results	Precise enforcement contracts with numbered rules
Checklist-free maintenance	Review quality depends on memory and reviewer taste	Encode reusable audit passes with fixed outputs and triggers

Related skills

More from arisng/github-copilot-fc

Installs

Repository

arisng/github-copilot-fc

GitHub Stars

First Seen

Mar 9, 2026

Security Audits

Gen Agent Trust HubPass

SocketPass

SnykPass