devpilot-harness-engineering
Harness Engineering
Overview
An agent harness is everything around a coding agent except the model itself: the guides that steer it before it acts, the sensors that catch drift after it acts, the context it reads, the sub-agents it delegates to, and the automation that maintains code quality in the background.
Core principle: When agents write most of the code, the engineering team's primary job shifts from writing code to building the harness that lets agents write good code. Constraints you'd normally postpone until hundreds of engineers are onboard become day-one prerequisites.
Operating definition (Mitchell Hashimoto): every time the agent makes a mistake, engineer the harness so it cannot make that mistake again. The harness is never "done" — it grows from observed failures, not speculation.
Synthesized from OpenAI's "Harness engineering: leveraging Codex in an agent-first world" (a ~5-month experiment by a team growing from 3 to 7 engineers, producing roughly 1M lines of code and ~1,500 merged PRs with no manually-written code), Martin Fowler's harness-engineering article, and HumanLayer's "Skill Issue" writeup.
When to Use
Use when:
- Starting a repo where agents will author most code
- Agent output quality is decaying (style drift, inconsistent patterns, duplicated utilities)
- You're spending more time reviewing agent PRs than the agents spent writing them
- Deciding where to invest: AGENTS.md, skills, sub-agents, linters, fitness functions, background refactor jobs
- Retrofitting guardrails onto a codebase that now has agents loose in it
Don't use when:
- You only need style rules for a specific language → use
devpilot-google-go-styleor similar - You need PR review mechanics → use
devpilot-pr-review - The task is a one-off script, not an ongoing codebase
Core Model
A harness has two control types applied across three regulation categories:
| Control | Direction | Examples |
|---|---|---|
| Guide | Feedforward — prevents bad output before it happens | AGENTS.md, skills, architecture docs, example-driven style guides |
| Sensor | Feedback — detects bad output after it happens | Linters, type checkers, tests, fitness functions, review agents |
| Category | What it regulates | Maturity |
|---|---|---|
| Maintainability | Style, structure, consistency | Most mature — reuse existing tooling |
| Architecture fitness | Performance, module boundaries, dependencies | Medium — needs fitness functions |
| Behavior | Functional correctness | Least mature — tests + manual |
Controls are either computational (deterministic, ms–s, cheap) or inferential (LLM-based, slower, richer feedback). Shift cheap ones left (pre-commit), expensive ones right (CI, background jobs).
Quick Reference — Where to Invest First
| Symptom | Add this |
|---|---|
| Agent doesn't know project conventions | AGENTS.md / CLAUDE.md at repo root (keep it short — HumanLayer recommends under ~60 lines; treat ~100 as a hard split point) |
| Agent repeats the same task awkwardly | A skill with SKILL.md + references/ |
| Agent floods its own context with noise | Delegate to sub-agents; let them cite, not dump |
| Style keeps drifting | Custom linter rules + pre-commit hook |
| Module boundaries get violated | ArchUnit-style structural tests |
| Quality decays PR-over-PR | Encode "golden principles" (taste/structure, not style rules — those belong in linters) + background GC refactor agent |
| Agent "finishes" but build is broken | Post-tool hooks that run build/typecheck and feed errors back |
| Too many MCP tools; context blown | Prune tool descriptions; scope tools per task. Prefer CLIs the model already knows from training over custom MCP wrappers (HumanLayer) |
Day-One Order of Operations
Work top-to-bottom; each step makes the next cheaper.
- Architectural constraints — decide module boundaries, error model, dependency surface before the agent adds a tenth variant of each. See
references/architecture.md. - Agent context — write a tight AGENTS.md and identify which specialized knowledge belongs in skills vs sub-agents vs tool descriptions. See
references/agents.md. - Guides + sensors — pair each important rule with a mechanical check (linter, test, fitness function). See
references/guides-and-sensors.md. - Depth-first decomposition + plan index — size each block so one = one context window = one reviewable PR, and track what's in flight in a plans index. See
references/depth-first-decomposition.mdfor sizing,references/plans.mdforPLANS.md+docs/exec-plans/format and the OpenSpec-vs-exec-plan decision. - Golden principles + GC loop — encode taste and run a background refactor agent to counter compounding drift. See
references/golden-principles.md.
Recommended Repo Layout
A harness-ready repo typically settles into this shape. Root-level .md files are always-on or frequently-loaded context; docs/ subtrees are on-demand reference the agent reads when a task pulls it in.
AGENTS.md # always-on preamble (see references/agents.md)
ARCHITECTURE.md # on-demand map + invariants (see references/architecture.md)
PLANS.md # index of active exec plans (see references/plans.md)
DESIGN.md # how we design — taste for non-obvious calls
FRONTEND.md # UI conventions (only if the repo has a frontend)
PRODUCT_SENSE.md # product bar, user-visible quality rules
QUALITY_SCORE.md # how we grade PRs / output
RELIABILITY.md # error budgets, on-call, failure-mode posture
SECURITY.md # threat model + non-negotiables
docs/
├── design-docs/ # durable design rationale; paired with exec-plans
│ ├── index.md
│ └── <topic>.md
├── exec-plans/
│ ├── active/ # one file per in-flight plan
│ ├── completed/ # archived when done; preserves decision trail
│ └── tech-debt-tracker.md
├── generated/ # machine-generated refs (db-schema, API shape) — never hand-edit
├── product-specs/
│ ├── index.md
│ └── <feature>.md
└── references/ # third-party *-llms.txt dumps (design-system, uv, nixpacks, …)
What each root file answers:
| File | Question the agent needs answered |
|---|---|
AGENTS.md |
What should I remember on every task in this repo? |
ARCHITECTURE.md |
What's the shape, and what invariants must I not break? |
DESIGN.md |
When the code could go two ways, which does this team prefer and why? |
FRONTEND.md |
What UI primitives / patterns exist; what's banned? |
PLANS.md |
What are we currently trying to ship, and what's next? |
PRODUCT_SENSE.md |
What makes a change user-worthy here? |
QUALITY_SCORE.md |
How will this PR be judged? |
RELIABILITY.md |
What happens when this breaks in prod; what's my budget? |
SECURITY.md |
What must I never do, regardless of how convenient? |
Per-file authoring guides (in references/):
Available now: agents.md, architecture.md, plans.md.
Not yet written: design.md, frontend.md, product-sense.md, quality-score.md, reliability.md, security.md — add when you've observed agents getting that doc wrong. Following Hashimoto's rule: the harness grows from observed failures, not speculation.
Not every repo needs every file. FRONTEND.md is pointless without a frontend; RELIABILITY.md is noise for a CLI tool. But AGENTS.md and ARCHITECTURE.md are mandatory for any repo with more than trivial agent activity.
Red Flags
- Agent PRs are large and require deep human review → decomposition missing, sensors too weak
- Same mistake appears across unrelated PRs → guide missing, promote the lesson into AGENTS.md / skill
- AGENTS.md has grown past ~100 lines and is full of "if X then Y" → move into skills with progressive disclosure
- You keep adding MCP tools "just in case" → tool descriptions are burning context on every turn
- Background quality is dropping but no single PR caused it → add a drift-detection / golden-principles sweep
Common Mistakes
- Treating AGENTS.md as a dumping ground. It's injected into every prompt. Keep it tight; push detail into on-demand skills.
- Only feedforward, no feedback. Docs alone don't catch the agent's 1% failures. Pair every important rule with a sensor.
- Only feedback, no feedforward. Letting the agent fail and retry burns tokens and time. Prevent what you can cheaply.
- Human-readable sensor output. Linter messages should include correction instructions the agent can act on, not just human-targeted prose.
- Ignoring compounding drift. Without a GC loop, each PR adds a little mess; in a month the repo is unrecognizable.
- Speculative configuration. Installing skills, MCP servers, or hooks "just in case" before observing a real failure bloats context and hides real signal. Build the harness from observed mistakes (Hashimoto), not from imagined ones.
- Ignoring context rot. Long sessions degrade even before hitting hard limits. Compaction, sub-agent delegation, and aggressive pruning are maintenance, not optimization.
The Bottom Line
Speed at agent scale comes from constraints, not freedom. A good harness makes the boring right thing easy and the drifty wrong thing impossible — so humans spend their attention on the decisions only humans can make.
More from siyuqian/devpilot
devpilot-scanning-repos
Use when the user asks to scan, audit, or sweep an entire GitHub repository for issues and file them as tickets — "scan this repo", "audit the codebase", "find bugs/security holes/missing tests", "check the docs are still accurate", "/repo-scan", "open issues for all the problems you find". Scans security, edge cases, testing coverage, and doc/code drift (CLAUDE.md, AGENTS.md, README.md and the docs they link to) without assuming business logic. Do NOT use for reviewing a single PR (use devpilot-pr-review) or language-specific style review (use devpilot-google-go-style).
9devpilot-resolve-issues
Use when the user wants to resolve, fix, work through, or burn down open GitHub issues in a repository — "fix all the issues", "resolve these tickets", "work through the repo-scan issues", "clear the backlog", "fix issue #42", "/resolve-issues", "解决所有issue", "修复这些issue", "处理issue backlog". Runs as a loop over matching issues until none remain; each REAL issue is fixed in its own `git worktree` so the main checkout stays untouched. Do NOT use for creating issues (use devpilot-scanning-repos) or for reviewing a PR the user already has open (use devpilot-pr-review).
9devpilot-pr-creator
>
8devpilot-content-creator
SEO-optimized blog and content writing skill. Use this skill whenever the user wants to write a blog post, create content for their website, improve SEO rankings, do keyword research, or plan content strategy. Triggers on any mention of blog writing, SEO content, keyword research, content marketing, "写博客", "写文章", "内容创作", "SEO优化", or when the user wants to create any form of long-form content for a website or product. Even if the user just says "write something about X for our site", use this skill.
3devpilot-google-go-style
>
2devpilot-prompt-review
Review and improve prompts for Claude (system prompts, CLAUDE.md files, SKILL.md files, API prompts, or any LLM instruction text). Use this skill whenever the user wants to review a prompt, improve prompt quality, check prompt best practices, audit instructions for Claude, optimize a system prompt, or asks "is this prompt good?". Also trigger when the user shares a prompt and asks for feedback, or when they mention prompt engineering, prompt optimization, or prompt hygiene. Even if they just paste a prompt and say "thoughts?" — use this skill.
2