harness-engineering-copilot
Harness Engineering for GitHub Copilot
Strategies and patterns for building high-leverage harnesses through Copilot's customization system. A harness is the scaffolding — context layering, constraint enforcement, entropy management, and feedback loops — that makes agents reliably productive.
Related skills:
harness-engineering— generic methodology (tool-agnostic).agent-customization— Copilot file formats, syntax, YAML frontmatter, tool aliases, and configuration rules. Defer all "how do I write this file" questions there.
Sources: OpenAI — Harness Engineering | Martin Fowler — Harness Engineering
Use this skill in two modes:
- Design mode — decide how to layer context, split roles, and add enforcement.
- Audit mode — review an existing Copilot harness with fixed passes, explicit budgets, and consistency checks.
Strategy 1: Context Layering
Context is a scarce resource. The harness must deliver the right context at the right time — not dump everything into every interaction.
The Three-Tier Context Model
| Tier | Copilot primitive | When it loads | Budget | What belongs here |
|---|---|---|---|---|
| Always-on | copilot-instructions.md, path-scoped .instructions.md |
Every interaction (auto) | ~100 lines repo-wide; ~50 lines per path scope | Architecture map, boundary rules, pointer references |
| On-demand | Agent skills (SKILL.md) |
When Copilot detects relevance | <500 lines body + unlimited references | Deep domain knowledge, schemas, runbooks |
| Task-scoped | Prompt files (.prompt.md), agent prompts |
When user explicitly invokes | No hard limit (30K chars for agents) | Step-by-step workflows, checklists, batch operations |
Cross-Check with agent-customization skill
Use the agent-customization decision flow to decide which primitive owns which layer instead of choosing files ad hoc.
| Context layer | Preferred primitive | Why this fits | agent-customization check |
|---|---|---|---|
| Always-on | Workspace instructions | Auto-loaded context should only hold stable routing rules and boundary constraints | Keep applyTo narrow; avoid applyTo: "**" unless the rule truly applies everywhere |
| Always-on, path-specific | File instructions | Architectural rules should follow folder or module boundaries | One instruction file per boundary, with explicit applyTo globs matching repo structure |
| On-demand | Skills | Deep reference material should load only when the task is relevant | Put discovery phrases in description; keep body lean and offload depth to references/assets |
| Task-scoped | Prompts | Repeatable batch workflows should be explicit, not always resident in context | Use prompts for single focused tasks with parameters |
| Task-scoped with isolation | Custom agents | Multi-stage work or restricted tool use needs isolated context and an enforcement contract | Use an agent when you need delegation, context isolation, or different tool boundaries per stage |
Before adding a new file, ask the agent-customization questions in this order:
- Scope: Is this workspace-shared behavior or a personal preference?
- Primitive: Is this always-on guidance, on-demand knowledge, a single focused prompt, or a delegated subagent workflow?
- Discovery: Will Copilot find it from the
description, or are you relying on file names and hope? - Load cost: Does this belong in auto-loaded context, or should it stay dormant until invoked?
If you cannot answer those four questions clearly, the layer boundary is still underspecified.
Context Layering Best Practices
- Map, don't manual. Always-on instructions should be a table of contents pointing to deeper sources elsewhere in the repo. If your
copilot-instructions.mdexceeds ~100 lines, you're overloading Tier 1. - One instruction file per architectural boundary. Use
applyToglob patterns that mirror your module/layer structure. When an agent editssrc/server/**, it should receive server-layer rules — not frontend rules. - Skills as progressive disclosure. Move detailed reference material (DB schemas, API specs, deployment procedures) into skills with
references/folders. The SKILL.md body stays lean; the agent loadsreferences/schema.mdonly when actually needed. - Honor the primitive boundary. If the content is multi-step workflow logic, it probably belongs in a prompt or custom agent, not in always-on instructions. If it is durable domain knowledge, it probably belongs in a skill, not a prompt.
- Redundancy kills. Never duplicate content across tiers. If a convention is in
copilot-instructions.md, don't repeat it in a skill. Use cross-references instead. - Pointer chains, not deep nesting. Every reference should be at most one hop from an entry point. If an agent needs to follow
AGENTS.md → docs/index.md → docs/design-docs/index.md → docs/design-docs/auth.md, that's too deep. Flatten toAGENTS.md → docs/design-docs/auth.md. - Treat
descriptionas part of the harness. A skill or agent that cannot be discovered by itsdescriptionis effectively missing from the harness, even if the file exists.
Context Freshness Strategy
Stale context is worse than no context — agents confidently follow outdated rules.
- CI validation: Add a job that checks cross-references between instruction files and actual repo paths. If an instruction references
docs/RELIABILITY.mdand the file doesn't exist, fail the build. - Git-blame-based staleness: Flag instruction files not updated in >90 days alongside the code they govern. If
backend.instructions.mdhasn't changed butsrc/server/has had 50 commits, something is probably stale. - Changelogs as triggers: When a significant architectural change merges, add "update harness" to the PR checklist. Treat instructions as code that must be kept in sync.
Strategy 2: Multi-Agent Constraint Enforcement
A single omniscient agent is an anti-pattern. Design an agent fleet (agent squad/agent swarm) where each agent has a narrow responsibility and minimal tool access.
The Observer / Actor / Maintainer Triad
| Role | Purpose | Tool profile | Invocation |
|---|---|---|---|
| Observer | Detect violations, audit quality, report findings | read + search only |
On-demand or scheduled |
| Actor | Implement changes within enforced boundaries | read + edit + search + execute |
Task-driven |
| Maintainer | Fix drift, update docs, refactor toward golden principles | read + edit + search |
Scheduled/periodic |
Designing an Agent Fleet
Step 1: Identify enforcement surfaces. List every invariant you want to enforce (dependency direction, naming, logging, test coverage, doc freshness). Each surface maps to an Observer agent.
Step 2: Define tool boundaries. For each agent, ask: What's the minimum tool set needed? An architecture reviewer needs read + search, never edit. A test generator needs execute to run tests. Over-provisioning tools undermines the harness.
Step 3: Write agent prompts as enforcement contracts. The agent's markdown body is its enforcement contract — a precise specification of what to check, what to flag, and how to remediate. Structure every enforcement agent prompt as:
1. SCOPE: What files/modules to examine
2. RULES: Numbered invariants to check (reference docs/ for details)
3. PROCESS: Step-by-step verification procedure
4. OUTPUT: Exact format for findings (file, line, violation, remediation)
5. BOUNDARIES: What the agent must NOT do
Step 4: Chain agents for complex workflows. Use subagent invocation (agent tool alias) to compose:
- Planner (read-only) → produces a plan
- Implementor (full tools) → executes the plan
- Reviewer (read-only) → validates the result
This mirrors the "Ralph Wiggum Loop" pattern: agents review each other's work in a feedback loop until all reviewers are satisfied.
Least-Privilege Patterns
| Pattern | Strategy | When to use |
|---|---|---|
| Read-only observer | tools: ["read", "search"] |
Auditing, reviewing, scanning |
| Edit-no-execute | tools: ["read", "edit", "search"] |
Documentation updates, config changes, refactoring |
| Full actor | tools: ["read", "edit", "search", "execute"] |
Implementation requiring test runs or builds |
| Scoped MCP | tools: ["read", "search", "playwright/screenshot"] |
Agents that need one specific external capability |
| Subagent-only | tools: ["read", "search", "agent"] |
Orchestrators that delegate all work |
Runtime-Specific Tool Assignment
Use runtime to describe the Copilot execution environment of the agent file: VS Code, GitHub Copilot coding agent, CLI-backed coding flows, background agents, cloud agents, or an SDK-hosted flow that adopts one of those schemas. Use platform for OS scope such as Windows, WSL/Linux, or macOS.
When the same specialist role needs different frontmatter, tool namespaces, or delegation mechanics across runtimes, create an agent variant: a runtime-specific .agent.md wrapper over shared instructions.
Official Runtime Facts to Design Around
| Fact | Design implication |
|---|---|
target is officially vscode or github-copilot |
Scope each variant to the runtime family it is meant for instead of assuming one file behaves identically everywhere |
tools defaults to all tools when omitted; tools: [] disables all tools |
Omit tools only when you genuinely want broad capability; otherwise whitelist aggressively |
| Unrecognized tool names are ignored | A mixed-runtime tool list can fail silently; audit tool names per runtime rather than trusting parse success |
disable-model-invocation is the canonical control and infer is retired |
Use the new field when deciding whether a variant is user-selectable, subagent-only, or both |
VS Code exposes agents, argument-hint, handoffs, and richer tool discovery |
Keep VS Code orchestration and guided transitions in the VS Code variant instead of leaking them into shared instructions |
GitHub Copilot runtime supports mcp-servers and namespaced MCP tools |
Put runtime-specific MCP wiring in the github-copilot variant, not in the shared behavior contract |
Agent Variant Pattern
Split each specialized agent into two layers:
- Shared instruction layer: persona, rules, workflow, artifacts, output contract, and tool-agnostic wording.
- Runtime variant layer:
target,tools,agents,handoffs,mcp-servers, and any runtime-specific tool references.
The shared layer should say what capability is required, not which runtime-specific tool name to call. Write "read the file", "run tests", or "delegate to the reviewer" rather than embedding a runtime-specific token.
Tool Assignment by Runtime Family
Use the official target values in the variant frontmatter: vscode or github-copilot.
| Runtime family | Frontmatter focus | Tool assignment rule | Delegation model |
|---|---|---|---|
| VS Code | target: vscode, optional agents, argument-hint, handoffs, model |
Use VS Code tool names, toolsets, extension tools, or MCP namespaced tools. If agents is specified, include the agent tool. |
Explicit subagent wiring via agents |
| GitHub Copilot | target: github-copilot, optional mcp-servers, disable-model-invocation, user-invocable |
Prefer official aliases such as read, edit, search, execute, agent, plus namespaced MCP tools like playwright/* or github/* |
Delegation through model invocation and custom-agent/task tooling |
| Background or cloud agent flow | Usually inherits the VS Code custom-agent model | Reuse a VS Code-oriented variant unless the host removes or overrides tools | Same as host runtime |
| SDK-hosted flow | Treat as host-defined until proven otherwise | Do not assume .agent.md fields or tool names map 1:1; align the variant to the runtime actually consuming it |
Host-dependent |
Assignment Heuristics
- Start from the minimum capability set needed for the role.
- Assign tools per runtime, not per persona. The reviewer persona may be read-only in every runtime, but the concrete tool names still differ.
- Put MCP server selection in the variant that can actually load it.
- Keep body text free of runtime-specific tool names unless the file is explicitly runtime-locked.
- Re-measure orchestration after tool changes. A variant that adds subagent access, shell execution, or MCP tools has changed its effective safety boundary.
Failure Modes to Audit
- A VS Code-only tool name in a
github-copilotvariant silently disappears. - A shared instruction body mentions a tool token that only exists in one runtime.
- A runtime variant inherits an all-tools default because
toolswas omitted unintentionally. - A subagent-only variant still allows model invocation because
disable-model-invocationwas not set. - A team treats Windows, Linux, or macOS as the runtime boundary when the real difference is the Copilot host.
Prompt Engineering for Enforcement
Agent prompts in a harness serve a different purpose than general-purpose prompts. They are mechanical contracts, not creative guidance.
- Be exhaustive about rules, terse about explanation. Don't explain why a rule exists in the agent prompt — put that in docs. The prompt should list exactly what to check.
- Include remediation in the output format. When an observer reports a violation, it should include the fix instruction. This becomes context for the actor agent that fixes it.
- Avoid open-ended judgment calls. "Review code quality" is too vague. "Check that every public function in
src/server/has a corresponding test intests/server/" is enforceable. - Reference, don't inline. Agent prompts should point to
ARCHITECTURE.mdordocs/conventions/for rule details rather than reproducing them. This prevents the prompt from going stale independently.
Strategy 3: Operational Audit Protocols
Harnesses usually fail in the operational details, not in the high-level design. Convert repo-specific review checklists into a reusable audit protocol with explicit passes and measurable outputs.
The Five Audit Passes
| Pass | Question | Typical checks |
|---|---|---|
| Structural contract | Does each instruction or agent file contain the sections the workflow depends on? | Required sections present, numbered rules complete, one shared protocol block instead of duplicates, output contracts match referenced artifacts |
| Budget discipline | Is every prompt body below the runtime limit with safety margin? | Soft ceiling below hard max, frontmatter excluded from measurement, embedded agent body re-measured after every instruction change |
| Consistency | Do names, signals, paths, and schemas line up across files? | Exact agent names, exact delegation targets, one signal vocabulary, one path convention, no deprecated frontmatter keys |
| Ownership | Does every writable artifact have exactly one owner? | Planner-only artifacts stay planner-owned, shared progress artifacts are append-only, reviewers observe rather than overwrite |
| Distribution | Will the runtime load this deterministically? | Build-time embedding or bundling, no runtime instruction reads for CLI-only bodies, official plugin schema only |
Audit Output Contract
Every audit pass should report findings in a fixed shape:
| Field | Meaning |
|---|---|
| Artifact | File, folder, or workflow surface being checked |
| Invariant | Exact rule that failed |
| Evidence | Concrete mismatch: missing section, wrong name, over-budget body, conflicting owner |
| Remediation | Smallest change that restores consistency |
| Re-run trigger | What future edit should force this pass to run again |
Compression Is a Harness Capability
Prompt compression is not cosmetic. It preserves budget for task context and reduces drift.
Always remove:
- Motivational preambles before the real contract
- Inline explanations of why a rule exists
- Duplicated protocol blocks repeated per mode
- Example output blocks that only restate an existing schema
- Verbose decision trees better expressed as flat steps
Always preserve:
- All numbered rules in the enforcement contract
- Every supported mode name and its complete step list
- Exact path patterns and naming conventions
- Contract field names, required/optional status, and signal schema
- Preflight gates and failure-mode handling
What to Measure
For every auditable harness, keep a small measurement set close at hand:
- Prompt body length after frontmatter stripping.
- Combined wrapper-plus-embedded body length for distributed agents.
- Count of active signal types and whether all files use the same schema.
- Artifact ownership table: one writer per artifact, explicit shared append-only logs.
- Revalidation triggers: which edits force which audit passes.
Strategy 4: Entropy Management
Every agent-generated line of code can introduce drift. Entropy management is the discipline of detecting and correcting drift before it compounds.
Golden Principles Framework
Define a small set (5–10) of non-negotiable, mechanically verifiable rules:
| # | Principle | Verification method |
|---|---|---|
| 1 | Shared utilities over hand-rolled helpers | Lint: flag duplicate utility patterns |
| 2 | Parse data at boundaries, never YOLO-probe | Lint: detect untyped API calls |
| 3 | Structured logging with correlation IDs | Lint: flag console.log / raw print |
| 4 | One module = one domain, no cross-domain imports | Structural test: import graph validation |
| 5 | Every public API has a test | Coverage check: map exports → test files |
Key insight: Golden principles must be verifiable by linters, structural tests, or agents — not just documented. If you can't automate the check, it's a guideline, not a golden principle.
Garbage Collection Cadence
| Frequency | What to check | Agent type |
|---|---|---|
| Per-commit (CI) | Lint rules, structural tests, doc cross-references | Deterministic (linters/tests) |
| Daily | Doc freshness, quality score drift, stale TODOs | Maintainer agent |
| Weekly | Golden principle deviations, pattern duplication, tech debt inventory | Maintainer agent, batch prompt |
| Per-sprint | Full quality scoring across all domains/layers | Observer agent + human review |
Quality Scoring Pattern
Maintain a versioned QUALITY_SCORE.md that grades each domain and layer:
| Domain | Types | Config | Service | Tests | Docs | Overall |
|--------|-------|--------|---------|-------|------|---------|
| Auth | A | A | B | B | C | B |
| Billing| A | B | B | C | D | C+ |
| Search | B | B | C | D | F | D+ |
This gives both humans and agents a map of where debt lives. A maintainer agent can read this and prioritize: "Search.Docs is F — generate missing documentation for the Search domain."
Entropy Detection Strategies
- Pattern divergence scan. Compare how a pattern is implemented across modules. If 8 out of 10 services use structured logging but 2 use
console.log, flag the outliers. - Doc/code freshness ratio. If
src/billing/changed 30 times this month butdocs/billing.mdchanged 0 times, the docs are likely stale. - Lint error message engineering. Write custom lint error messages that include remediation instructions. When a linter reports "Import from
auth/inbilling/violates boundary — move shared types toshared/types/", the agent can act on it directly. The error message is the agent's instruction. - Snapshot diffing. Periodically generate a snapshot of the repo structure (file tree, import graph, export map) and diff against the previous snapshot. Large deltas without corresponding doc updates signal drift.
Strategy 5: Application Legibility
Agents can only verify what they can observe. Extending the agent's senses beyond static files dramatically increases harness leverage.
Legibility Layers
| Layer | What the agent can see | Copilot mechanism | Harness value |
|---|---|---|---|
| Static files | Source code, docs, config | Built-in (read/search) | Baseline — always available |
| Build/test output | Compilation errors, test results, lint output | execute tool |
Validates correctness |
| Browser state | DOM, screenshots, navigation, console errors | MCP (Playwright) | UI verification without manual testing |
| Runtime telemetry | Logs, metrics, traces | Custom MCP server | Performance and reliability validation |
| Repository state | Issues, PRs, CI status, branch state | MCP (GitHub) | Workflow-aware decisions |
Legibility Best Practices
- Make the app bootable per task. If an agent can't start the application in isolation, it can't validate its own work. Design for single-command startup (Docker Compose, Aspire, etc.).
- Wire browser automation for UI verification. An agent that edits a Blazor component should be able to screenshot the result. This closes the feedback loop without human eyes.
- Expose observability to the agent. If you have logs and metrics, make them queryable. An agent instruction like "ensure no error logs during startup" is only enforceable if the agent can read the logs.
- Ephemeral environments. Each agent task should run against an isolated instance. Shared environments introduce cross-task interference that agents can't reason about.
Strategy 6: Harness Evolution
A harness is not a one-time setup. It evolves with the codebase.
Maturity Model
| Level | Context | Constraints | Entropy | Legibility |
|---|---|---|---|---|
| 1 — Ad-hoc | No instructions | No enforcement | No cleanup | Static files only |
| 2 — Documented | copilot-instructions.md exists |
Conventions documented but not enforced | Manual cleanup | Build/test output |
| 3 — Scoped | Path-scoped instructions + skills | Linters catch some violations | Weekly batch prompts | Browser automation |
| 4 — Enforced | Full three-tier context | Structural tests + CI gates block violations | Daily maintainer agents | Runtime telemetry |
| 5 — Self-healing | Agent-maintained instructions | Agents detect and fix violations autonomously | Continuous GC with quality scoring | Full stack legibility |
Evolution Workflow
- Assess. Use the harness assessment from the
harness-engineeringskill. Rate each component 0–2. - Target one level up. Don't jump from Level 1 to Level 5. Move one level at a time per component.
- Instrument before enforcing. Before adding a linter, add an observer agent to measure the current state. Quantify violations before blocking them.
- Encode human taste continuously. Every time a human reviews agent output and says "that's not how we do it here," capture the rule — as a lint rule, instruction update, or golden principle. Never rely on the same correction twice.
- Retroactively harness brownfield code. Start with context engineering (cheapest, highest ROI), add constraints per-module as you touch them, and add entropy management last.
Revalidation Triggers
Tie harness maintenance to concrete change events instead of vague periodic reviews.
| Change | Re-run |
|---|---|
| Instruction body changed | Structural contract, budget discipline, consistency |
| Agent frontmatter or delegation changed | Consistency, distribution |
| Signal schema changed | Consistency across every agent and instruction |
| Workflow paths changed | Structural contract, ownership, consistency |
| Plugin or bundle pipeline changed | Distribution and budget discipline |
If a harness cannot tell you which checks to rerun after a change, it is still too implicit.
Scaling Across Repositories
For organizations with multiple repos:
- Harness template repos. Create starter repo templates with pre-configured instruction files, standard agent fleet, and CI validation jobs. This parallels "golden path" service templates, optimized for agent-driven development.
- Org-level agents. Define organization-wide enforcement agents (via
.github-privaterepo) that apply everywhere. Keep repo-level agents for domain-specific concerns. - Shared skills as packages. Extract common skills (architecture validation, deployment runbooks) into a shared repo and publish to personal skill directories for cross-repo reuse.
- Federated golden principles. Maintain a core set of golden principles organization-wide, with repo-specific extensions. Enforce the core set from org-level; let repos own their extensions.
Anti-Patterns
| Anti-pattern | Why it fails | Better strategy |
|---|---|---|
| Monolithic instructions | Crowds out task context at 1000+ lines; rots fast | Three-tier context layering with 100-line Tier 1 |
| One omniscient agent | No tool boundaries; can't reason about everything at once | Agent fleet with Observer / Actor / Maintainer roles |
| Duplicate context | Same rule in instructions, skills, and agent prompts diverges over time | Single source of truth with cross-references |
| Verbal-only enforcement | "We always do X" isn't legible to agents | Encode in linters, tests, or observer agents |
| Big-bang cleanup | 20% of the week on "AI slop" doesn't scale | Continuous garbage collection at daily cadence |
| Static-only legibility | Agent can't verify its own UI changes or performance | Wire browser automation and observability MCP |
| Set-and-forget harness | Codebase evolves, harness doesn't | Treat instructions as code; CI-check freshness |
| Open-ended agent contracts | "Review code quality" produces inconsistent results | Precise enforcement contracts with numbered rules |
| Checklist-free maintenance | Review quality depends on memory and reviewer taste | Encode reusable audit passes with fixed outputs and triggers |