methodology
Cavekit Methodology
Core Principle: Specify Before Building
Always define what you want before telling agents how to build it. Go through a cavekit stage — never jump straight from raw requirements to implementation.
Cavekit is a methodology for building software with AI coding agents that puts kits at the center of the development process — code is derived from them, not the other way around. Whether starting from scratch or modernizing an existing system, the principle is the same:
- Greenfield projects: reference material → kits → code
- Rewrites: old code → kits → new code
In both cases, the kits become a living contract that agents consume to continuously build, validate, and refine the application.
Why Kits Are the First-Class Citizen
| Property | Benefit |
|---|---|
| Structured | Organized as a navigable tree, enabling agents to load only what they need |
| Human-legible | Engineers can audit requirements at a higher level than code |
| Stack-independent | Decoupled from any single framework or language |
| Independently evolvable | Kits can be refined without touching implementation |
| Verifiable | Every requirement includes acceptance criteria agents can check |
Key Insight: Well-written kits with strong validation make your application reproducible — any agent can rebuild it from the kits alone. Think of it as continuous regeneration.
The Scientific Method Analogy
LLMs are inherently non-deterministic — like running an experiment, each individual call may yield different results. But through the right methodology — clear hypotheses, controlled conditions, and repeated trials — we extract reliable, reproducible outcomes from a stochastic process.
Cavekit applies the scientific method to software construction — hypothesize, test, observe, refine.
| Layer | Analogy | What It Does |
|---|---|---|
| LLM calls | Individual experiments | Each run may produce different results; no single output is authoritative |
| Kits | Hypotheses | Define what you expect to observe — the predicted behavior |
| Validation gates | Controlled conditions | Ensure reproducibility by constraining what counts as a valid outcome |
| Convergence loops | Repeated trials | Build statistical confidence through successive passes |
| Implementation tracking | Lab notebook | Record what was tried, what worked, and what failed |
| Revision | Revising the hypothesis | When results contradict expectations, update the theory upstream |
The outcome: a disciplined, repeatable engineering process layered on top of probabilistic generation.
The 5 Hunt Phases
The Hunt is the four-phase lifecycle: Sketch, Map, Make, Check. Each phase has dedicated prompts that drive it.
| Phase | Input | Output | AI Role | Human Role |
|---|---|---|---|---|
| Draft | Source materials, domain knowledge, existing systems | Implementation-agnostic kits | Extract requirements, structure knowledge | Verify kits capture intent accurately |
| Architect | Kits + framework research | Framework-specific implementation plans | Design architecture, break down work, order steps | Approve architectural choices |
| Build | Plans + kits | Working code + tests + tracking docs | Write code, run tests, check against kits | Watch for drift and blockers |
| Inspect | Failed validations, gaps, manual fixes | Updated kits/plans via revision | Identify root causes, propagate fixes upstream | Evaluate outcomes, set priorities |
| Monitor | Running application, git history | Issues, anomalies, progress reports | Scan for regressions, surface metrics | Interpret reports, guide next steps |
Phase Transitions
Each phase has gate conditions that must be met before moving to the next:
- Draft → Architect: All domains have kits with testable acceptance criteria. Human has reviewed for completeness.
- Architect → Build: Plans reference kits, define implementation sequence, and include test strategies. Architecture decisions validated.
- Build → Inspect: Code builds, tests pass at current coverage level, implementation tracking is up to date.
- Inspect → Monitor: Convergence detected (changes decreasing iteration-over-iteration). Remaining changes are trivial.
- Monitor → Draft (cycle): Gap found or new requirement identified. Revise kits and restart the cycle.
The Inspect phase is where the human serves as reviewer and decision-maker, not hands-on coder. You monitor the process, request changes as needed, and make systemic improvements to kits and prompts.
For the full Hunt phase reference, see
references/hunt-phases.md.
Decision Matrix: When to Use Cavekit
Full Cavekit
Use when the project has significant scope, evolving requirements, or needs autonomous agent execution.
| Indicator | Threshold |
|---|---|
| Codebase size | 50+ source files |
| Requirements | Evolving, multi-domain |
| Agent coordination | Multi-agent or multi-prompt pipelines |
| Environment | Production, security-sensitive, brownfield |
| Team structure | Multi-team or cross-team |
| Execution mode | Long-running autonomous work (overnight, unattended) |
What you get: Full Hunt lifecycle, context directory with kits/plans/impl tracking, prompt pipeline, convergence loops, revision, validation gates.
Lightweight Cavekit
Use when scope is moderate — too complex for ad-hoc but not worth a full pipeline.
| Indicator | Threshold |
|---|---|
| Codebase size | 5-50 files |
| Requirements | Mostly clear, focused |
| Agent coordination | Single agent, possibly with sub-agents |
| Execution mode | Interactive with occasional iteration loops |
What you do:
- Write a focused
context/kits/cavekit-task.mdcapturing requirements - Add a
context/plans/plan-task.mdsequencing the implementation - Skip full Hunt — just run an iteration loop against the plan
This is the "Cavekit floor" — most of the benefit without the overhead of a full multi-phase pipeline.
Skip Cavekit
Use when the task is trivially small.
| Indicator | Threshold |
|---|---|
| Codebase size | Less than 5 files |
| Task type | One-off tools, simple bug fixes, exploratory prototypes |
| Implementation | Fits comfortably in one agent session without needing external references |
Heuristic: If the whole task fits in one context window with room to spare, full Cavekit adds more overhead than value.
Growth Path
Start with lightweight Cavekit even if the project is small. If the scope expands, you already have the structure in place to scale up. It is much harder to retrofit kits onto a large codebase than to grow a cavekit directory from the beginning.
The CI Pipeline Analogy
Cavekit mirrors a build pipeline — each stage transforms input into validated output, with feedback loops that propagate corrections upstream:
Traditional CI/CD:
Code → Build → Test → Deploy
Cavekit AI Pipeline:
Cavekit Change
→ Generate Plans (iteration loop)
→ Generate Implementation (iteration loop)
→ Validate (Tests + Review)
→ Human Audit (Monitor & Steer)
→ [Gap Found]
→ Revise
→ Cavekit Change (cycle repeats)
Every stage can run as an iteration loop — the same prompt executed repeatedly until output stabilizes. The iteration loop is what transforms nondeterministic LLM output into predictable, validated software.
The Iteration Loop
The iteration loop is the fundamental execution unit in Cavekit. Execute the same prompt against the same codebase multiple times until the delta between runs approaches zero.
Mechanics:
- Execute a prompt against the current codebase
- The agent inspects git history and tracking documents to understand what has already been done
- The agent applies changes and commits its progress
- Return to step 1
Convergence signal: A shrinking volume of modifications across successive passes — the diff gets smaller each time until only cosmetic changes remain. You are looking for diminishing returns, not absolute zero.
When the loop isn't stabilizing, the problem is upstream — fix the inputs (specs, validation, coordination), not the iteration count.
If the diff is not shrinking between runs:
- Kits are ambiguous (agents interpret them differently each time)
- Validation criteria are too loose (the agent has no way to confirm it got things right)
- Multiple agents are overwriting each other's work (ownership boundaries are unclear)
Cross-References to Sub-Skills
Cavekit is composed of techniques that work together. This methodology skill is the index — each sub-skill below is self-contained but cross-references others.
Foundation Skills
| Skill | Purpose | When to Use |
|---|---|---|
ck:cavekit-writing |
Write implementation-agnostic kits with testable acceptance criteria | Draft phase — always the first step |
ck:context-architecture |
Organize context for progressive disclosure | Project setup and ongoing maintenance |
ck:impl-tracking |
Track implementation progress, dead ends, test health | Build and Inspect phases |
ck:validation-first |
Design validation gates agents can execute | All phases — validation is continuous |
Pipeline Skills
| Skill | Purpose | When to Use |
|---|---|---|
ck:prompt-pipeline |
Design numbered prompt pipelines for the Hunt | Setting up automation |
ck:revision |
Trace bugs back to kits and fix at the source | Inspect phase — after finding gaps |
cavekit:brownfield-adoption |
Adopt Cavekit on existing codebases | Starting Cavekit on legacy projects |
Advanced Skills
| Skill | Purpose | When to Use |
|---|---|---|
ck:peer-review |
Use a second agent to challenge the first | Quality gates, architecture review |
cavekit:speculative-pipeline |
Stagger pipeline stages for parallelism | Optimizing long pipelines |
ck:convergence-monitoring |
Detect convergence vs ceiling | Monitoring iteration loops |
cavekit:documentation-inversion |
Turn documentation into agent-consumable skills | Library/module documentation |
Integration with Existing Skills
Cavekit works with existing skills, not as a replacement:
| Existing Skill | Cavekit Integration |
|---|---|
superpowers:brainstorming |
Use during cavekit generation to explore requirements |
superpowers:writing-plans |
Use during plan generation for structured planning |
superpowers:test-driven-development |
TDD-within-Cavekit: cavekit acceptance criteria become failing tests |
superpowers:verification-before-completion |
Use for gate validation in every phase |
superpowers:executing-plans |
Use during implementation phase |
superpowers:dispatching-parallel-agents |
Use for agent team coordination |
Quick Start
For a New Project (Greenfield)
-
Set up context directory:
context/ ├── refs/ # Source materials (PRDs, language specs, research) ├── kits/ # Implementation-agnostic kits ├── plans/ # Framework-specific implementation plans ├── impl/ # Living implementation tracking └── prompts/ # Hunt pipeline prompts -
Write kits from your reference materials (see
ck:cavekit-writing) -
Generate plans from kits (see
ck:prompt-pipeline) -
Implement with validation gates (see
ck:validation-first) -
Track progress in implementation documents (see
ck:impl-tracking) -
Iterate — when gaps are found, revise kits (see
ck:revision)
For an Existing Project (Brownfield)
- Set up context directory (same structure as above)
- Designate existing codebase as reference material
- Generate kits from code (see
cavekit:brownfield-adoption) - Validate kits match behavior — run tests against generated kits
- Proceed with normal Hunt — future changes flow through kits first
Summary
Cavekit is not a tool — it is a methodology. The core loop is simple:
- Describe what you want (kits with testable criteria)
- Let agents build it (plans → implementation → validation)
- Fix the kits, not the code (revision)
- Repeat until converged (iteration loops)
Agents become more capable the more precisely you constrain them — clear kits, automated validation, and structured iteration loops let them operate with increasing autonomy. None of this eliminates the need for software engineers. Your judgment on architecture, your ability to write precise kits, and your instinct for what "done" looks like are the inputs that make the whole system function. Cavekit is a force multiplier: one engineer's clarity of thought, scaled across an entire implementation pipeline.