ghost
ghost
Overview
Generate a ghost package (spec + tests + install prompt) from an existing repo.
Preserve behavior, not prose:
tests.yamlis the behavior contract (operation cases and/or scenarios)- source tests and/or captured traces are the primary evidence
- code/docs/examples only fill gaps (never contradict evidence)
The output is language-agnostic so it can be implemented in any target language or harness.
Scenario testing frame (for agentic systems / tool loops):
- Given an initial world state + tool surface + user goal
- When the agent runs under realistic constraints and noise
- Then it reaches an acceptable outcome without violating invariants (safety, security, cost, latency, policy)
Fit / limitations
This approach works best when the system’s behavior can be expressed as deterministic data:
- pure-ish operations (input -> output or error)
- a runnable test suite covering the public API
It also works for agentic systems when behavior can be expressed as controlled, replayable scenarios:
- a tool sandbox (stubs/record-replay/simulator)
- machine-checkable oracles (state assertions + trace invariants)
- a deterministic debug mode plus a production-like reliability mode (pass rates)
It gets harder (but is still possible) when the contract depends on time, randomness, IO, concurrency, global state, or platform details. In those cases, make assumptions explicit in SPEC.md + VERIFY.md, and normalize nondeterminism into explicit inputs/outputs.
Hard rules (MUST / MUST NOT)
- MUST treat upstream tests (and for agentic systems: captured traces/eval runs) as authoritative; if docs/examples disagree, prefer evidence and record the discrepancy.
- MUST normalize nondeterminism in the environment/tool surface into explicit inputs/outputs (no implicit "now", random seeds, locale surprises, unordered iteration).
- MUST make model/agent stochasticity explicit and test it as reliability: gate on pass rates + invariant-violation-free runs (not exact-text goldens).
- MUST keep the ghost repo language-agnostic: ship no implementation code, adapter runner, or build tooling.
- MUST paraphrase upstream docs; do not copy text verbatim.
- MUST preserve upstream license files verbatim as
LICENSE*. - MUST produce a verification signal and document it in
VERIFY.md(adapter runner preferred; sampling fallback allowed). - MUST document provenance and regeneration in
VERIFY.md(upstream repo + revision, how artifacts were produced, and how to rerun verification). - MUST choose a
tests.yamlcontract shape that matches the system style (functional API vs protocol/CLI vs scenario) and keep it consistent acrossSPEC.md,INSTALL.md, andVERIFY.md. - MUST document the
tests.yamlharness schema when it is non-trivial (callbacks, mutation steps, warnings, multi-step protocol setup, etc.).- Recommended artifact:
TESTS_SCHEMA.md. INSTALL.mdMUST reference it when present.
- Recommended artifact:
- MUST minimize
skipcases; only skip when deterministic setup is currently infeasible, and record why. - MUST assert stable machine-interface fields explicitly (required keys, lengths/counts, and state effects), not only loose partial matches.
- MUST treat human-readable warning/error messages as unstable unless tests prove they are part of the public contract.
- Prefer structured fields (codes) or substring assertions for message checks.
- MUST capture cross-operation state transitions when behavior depends on prior calls (for example session, instance, history, or tool-loop continuity).
- MUST include executable end-to-end loop coverage for each primary stateful workflow (for example create -> act -> persist -> follow-up) with explicit pre/post state assertions.
- MUST treat a stateful workflow as incomplete if only isolated operation cases exist; add scenario coverage in
tests.yamland verification proof before calling extraction done. - MUST include trace-level invariants for agentic scenarios (for example permission boundaries, confirmation-before-side-effects, injection resistance, budget/step limits).
- MUST prefer oracles that score behavior via state + trace (tool calls, side effects) over brittle final-text matching.
- MUST produce a machine-checkable evidence bundle under
verification/evidence/and fail extraction unless it passesuv run --with pyyaml -- python scripts/verify_evidence.py --bundle <ghost-repo>/verification/evidence. - MUST keep
verification/evidence/inventory.jsonsynchronized withtests.yaml:public_operationsmust match non-workflow operation ids andprimary_workflowsmust match workflow/scenario ids (coverage_modedefaults toexhaustive; whensampled, includesampled_case_ids). - MUST ensure every required case id appears in
traceability.csvand has at least one baseline (mutated=false)passrow inadapter_results.jsonl(alltests.yamlcases forexhaustive;inventory.json.sampled_case_idsforsampled). - MUST enforce fail-closed verification thresholds: 100% mapped public operations, 100% mapped primary workflows, and 100% mapped required case ids (all tests for
exhaustive; sampled ids forsampled), plus mutation sensitivity and independent regeneration parity passes. - MUST declare verification coverage mode in
VERIFY.md: defaultexhaustive;sampledis allowed only when full adapter execution is infeasible and must list sampled case ids plus rationale (includinginventory.json.sampled_case_ids). - MUST define and enforce conformance profiles in generated artifacts:
Core Conformance,Extension Conformance, andReal Integration Profile. - MUST include
Conformance Profile,Validation Matrix, andDefinition of Donesections inSPEC.md. - MUST include
Summary,Regenerate,Validation Matrix,Traceability Matrix,Mutation Sensitivity,Regeneration Parity, andLimitationssections inVERIFY.md. - MUST include typed failure classes for extraction/verification failures (for example missing artifacts, parse failures, and contract mismatches).
- MUST require stateful/scenario ghost specs to include lifecycle structure sections in
SPEC.md:State Model,Transition Triggers,Recovery/Idempotency, andReference Algorithm. - MUST run the evidence verifier in strict mode by default; legacy bypass is break-glass only (
--legacy-allow --legacy-reason "<rationale>") and never default.
Conformance profiles (required)
Core Conformance:- deterministic contract extraction requirements that every ghost package must satisfy
- strict evidence gates and fail-closed verification
Extension Conformance:- optional behaviors implemented by an extraction for stronger fidelity or ergonomics
- must be explicitly labeled as optional and tested if claimed
Real Integration Profile:- environment-dependent checks that validate production-like behavior
- may be skipped only with explicit rationale in
VERIFY.md
Profile usage rules:
SPEC.mdandVERIFY.mdmust state which profile each validation requirement belongs to.Validation MatrixandDefinition of Donemust align with the selected profile labels.- Stateful/scenario workflows must include lifecycle sections regardless of profile.
Inputs
- Source repo path (git working tree)
- Output repo name/location (default: sibling directory
<repo-name>-ghost) - Upstream identity + revision (remote URL if available; tag/commit SHA)
- Public surface if ambiguous:
- library: functions/classes/modules
- agentic system: tool names/schemas, permissions, and side-effect boundaries
- Source language/runtime + how to run upstream tests
- Any required runtime assumptions (timezone, locale, units, encoding)
For scenario-heavy (agentic) extractions, also collect:
- scenario catalog (top user goals + failure modes)
- tool error/latency behaviors (timeouts, 500s, malformed payloads)
- explicit invariants (security, safety, cost, latency, policy)
Conventions
Operation ids
tests.yaml organizes cases by operation ids (stable identifiers for public API entries). Use a naming scheme that survives translation across languages:
foo(top-level function)module.foo(namespaced function)Class#method(instance method)Class.method(static/class method)
Avoid language-specific spellings in ids (e.g., avoid snake_case vs camelCase wars). Prefer the canonical name used by the source library’s docs.
For agentic scenario suites, operation ids SHOULD match tool names as the agent sees them (e.g. orders.lookup, tickets.create).
Scenario ids
When using scenario testing, keep scenario ids stable and descriptive:
refund.create_ticket_with_guardrailscalendar.reschedule_with_rate_limitsecurity.prompt_injection_from_tool_output
Case ids
Every executable case SHOULD carry a stable case_id and use it as the primary key across evidence artifacts.
- Prefer
<operation-id>.<behavior>for operation cases. - For single-case workflow/scenario targets, reusing the workflow/scenario id as
case_idis acceptable. traceability.csvandadapter_results.jsonlMUST use the samecase_idtokens.
Contract shape
Pick one schema and stay consistent:
- Functional API layout: operation ids at top-level with
{name,input,output|error}cases. - Protocol/CLI layout: top-level
meta+operations, where operation ids live underoperationsand cases include command/state assertions. - Scenario layout (agentic systems): top-level
meta+scenarios, where scenario ids live underscenariosand each scenario defines environment + tools + goal + oracles.
tests.yaml version
tests.yaml MUST include a source version identifier that ties cases to upstream evidence.
- If the upstream library has a release version (SemVer/tag), use it.
- Otherwise, use an immutable source revision identifier (e.g.,
git:<short-sha>orgit describe). - Functional layout: use top-level
version. - Protocol/CLI layout: keep
meta.versionfor test schema version and includemeta.source_versionfor upstream evidence version. - Scenario layout: keep
meta.versionfor schema version and includemeta.source_versionfor upstream evidence version.
Workflow (tests-first)
0) Define scope and contract
- Write a one-line problem statement naming the upstream repo/revision and target ghost output path.
- Choose one
tests.yamllayout (functional, protocol/CLI, or scenario) and keep it consistent acrossSPEC.md,INSTALL.md, andVERIFY.md. - Set success criteria: deterministic cases for every public operation, executable loop coverage for primary stateful workflows, and a recorded verification signal in
VERIFY.md.
For agentic systems, define success criteria as:
- critical scenarios expressed in a controlled tool sandbox
- hard oracles + trace-level invariants (no critical violations)
- reliability gates (pass rate thresholds) for production-like runs
1) Scope the source
- Locate the test suite(s), examples, and primary docs (README, API docs, docs site).
- Identify the public API and map each public operation to an operation id.
- Use export/visibility cues to confirm what’s public:
- JS/TS: package entrypoints + exports/re-exports
- Python: top-level module +
__all__ - Rust:
pubitems re-exported fromlib.rs - Zig:
build.zigmodule graph (root_source_file,addModule,pub usingnamespace) is source of truth; defaults are oftensrc/root.zig(library) andsrc/main.zig(exe) but repos vary; treat C ABIexportas public only if documented - C/C++: installed public headers + exported symbols; include macros/constants only if documented as API
- Go: exported identifiers (Capitalized)
- Java/C#:
publictypes/members in the target package/namespace - Other: use the language’s visibility/export mechanism + published package entrypoints
- Confirm which functions/classes are in scope:
- public API + tests covering it
- exclude internal helpers unless tests prove they are part of the contract
- Identify primary user-facing workflows (especially stateful loops) and map each workflow to required operation sequences and state boundaries.
For agentic systems:
- Identify the tool surface (names, schemas, permissions, rate limits).
- Identify the environment/state (what changes when tools are called).
- Identify invariants (safety/security/cost/latency/policy) that must hold across the full trace.
- Build a coverage matrix (functional, robustness, safety/security/abuse, cost/latency).
- Decide the output directory as a new sibling repo unless the user overrides.
2) Harvest behavior evidence
- Extract test cases and expected outputs (or scenario traces); treat evidence as authoritative.
- When tests are silent, read code/docs to infer behavior and record the inference.
- Note all boundary values, rounding rules, encoding rules, and error cases.
- If the API promises "copy"/"detached" behavior, harvest mutation-isolation evidence (including nested structure mutation, not just top-level fields).
- For stateful APIs, harvest continuity evidence across steps (persisted ids, history chains, context/tool carry-forward, and reset semantics).
- Normalize environment assumptions:
- eliminate dependency on current time (use explicit timestamps)
- force timezone/locale rules if relevant
- remove nondeterminism (random seeds, unordered iteration)
For scenario suites, also harvest:
- realistic tool failures (timeouts/500s/malformed JSON/partial results) and backoff/retry behavior
- prompt-injection-like tool outputs and required refusal/ignore behavior
- stop conditions (max steps, budget) and graceful halts
3) Write SPEC.md (strict, language-agnostic)
- Include
Conformance Profile,Validation Matrix, andDefinition of Donesections. - Describe types abstractly (number/string/object/timestamp/bytes/etc.).
- For bytes/buffers, define a canonical encoding (hex or base64) and use it consistently in
tests.yaml. - Define normalization rules (e.g., timestamp parsing, string trimming, unicode, case folding).
- Specify error behavior precisely (conditions), but keep the mechanism language-idiomatic.
- Include typed failure classes in the spec surface (machine-checkable failure names/codes where possible).
- Specify every public operation with inputs, outputs, rules, and edge cases.
- When an operation yields both a "prepared" value and a "persisted delta" (or similar), define the delta derivation mechanically (slice/filter/identity rules) and test it.
- Specify cross-operation invariants for primary workflows (state transitions, required ordering, and continuity guarantees).
- For scenarios, specify:
- state model and transition triggers
- recovery/idempotency behavior
- reference algorithm overview (language-agnostic)
- environment state model and reset semantics
- tool surface contracts (schemas, permissions, rate limits)
- invariants as explicit, testable rules (trace-level)
- Paraphrase source docs; do not copy text verbatim.
- Use
references/templates.mdfor structure.
4) Generate tests.yaml (exhaustive)
- Convert each source test into a YAML case under its operation id.
- Include the source version identifier (
versionormeta.source_version). - Schema is intentionally strict and portable; choose the contract shape from Conventions:
- Functional layout:
- each case has
name,input, and a stablecase_id(recommended) - each case has exactly one of
outputorerror: true
- each case has
- Protocol/CLI layout:
- top-level
meta+operations - each case has
case_id,name,input, and deterministic expected outcomes (for exampleexit_code, machine-readable stdout assertions, and state assertions)
- top-level
- keep to a portable YAML subset (no anchors/tags/binary) so it is easy to parse in many languages
- quote ambiguous scalars (
yes,no,on,off,null) to avoid parser disagreements
- Functional layout:
- Normalize inputs to deterministic values (avoid "now"; use explicit timestamps).
- Keep or improve coverage across all public operations and failure modes.
- Add scenario cases for primary stateful workflows so the contract proves end-to-end loop behavior, not only per-operation correctness.
- For agentic systems, prefer the scenario layout and define each scenario as:
- initial state (what the agent knows + world state)
- tool sandbox (stubs/record-replay/simulator) and permissions
- dynamics (how the world responds to tool calls, including failures/delays)
- success criteria (final state and/or required tool side effects)
- oracles (hard assertions + trace invariants; optional rubric judge)
- Prefer exact/value-complete assertions for stable output fields; use partial assertions only when fields are intentionally volatile.
- If assertions use path lookups, define path resolver semantics in
TESTS_SCHEMA.md(root object, dot segments,[index]arrays, and "missing path fails assertion"). - For warning/error message checks, prefer substring assertions unless the exact wording is itself part of the upstream contract.
- If
tests.yamlincludes harness directives beyond basic{name,input,output|error}(e.g. callbacks by label, mutation steps, warning sinks, setup scripts), document them inTESTS_SCHEMA.md. - Keep
skiprare; every skip must include a concrete reason and be accounted for inVERIFY.md. - If the source returns floats, prefer defining stable rounding/formatting rules so
outputis exact. - Follow the format in
references/templates.md.
5) Add INSTALL.md + README.md + VERIFY.md + LICENSE*
INSTALL.md: a short prompt for implementing the library in any language, referencingSPEC.mdandtests.yaml.README.md: explain what the ghost library is, list operations, and describe the included files.TESTS_SCHEMA.md(when needed): define thetests.yamlharness schema and any callback catalogs or side-effect capture requirements.VERIFY.md: describe provenance + how the ghost artifacts were produced and verified against the source library (adapter-first; sampling fallback).- include
Summary,Regenerate,Validation Matrix,Traceability Matrix,Mutation Sensitivity,Regeneration Parity, andLimitationssections - include upstream repo identity + exact revision (tag or commit)
- include the exact commands used to produce each artifact (or a single deterministic regeneration recipe)
- include the exact commands used to run verification and the resulting pass/skip counts
- include any environment normalization assumptions
- include a summary of
verification/evidence/and the verifier command/result - if legacy verifier bypass is used, include explicit break-glass rationale and follow-up remediation plan
- include
LICENSE*: preserve the upstream repo’s license files verbatim.- copy common files like
LICENSE,LICENSE.md,COPYING* - if no license file exists upstream, include a
LICENSEfile stating that no upstream license was found
- copy common files like
6) Verify fidelity (must do)
- Ensure
tests.yamlparses and case counts match or exceed the source tests covering the public API. - Ensure every operation id has at least one executable (non-
skip) case unless infeasible, and list any exceptions inVERIFY.md. - Preferred: create a temporary adapter runner in the source language to run
tests.yamlagainst the upstream system (library or agent).- if the source language has weak YAML tooling, parse YAML externally and dispatch into the library via a tiny CLI/FFI shim
- assert expected outcomes match exactly (outputs/errors for functional layout; exit/status/payload/state assertions for protocol layout)
- for stateful workflows, execute end-to-end loop scenarios and assert continuity/persistence effects across steps
- delete the adapter afterward; do not ship it in the ghost repo
- summarize how to run it (and results) in
VERIFY.md
- Build a fail-closed evidence bundle in
verification/evidence/:inventory.json(public operations + primary workflows, including reset requirements; optionalcoverage_mode, andsampled_case_idswhencoverage_mode=sampled)traceability.csv(operation/workflow -> case ids -> proof artifact -> adapter run id)workflow_loops.json(loop cases + continuity assertions + reset assertions when required)adapter_results.jsonl(case-level results withrun_id,case_id,status, and mutation marker)mutation_check.json(required mutation count + detected failures + pass/fail)parity.json(independent regeneration parity verdict + diff count)
- Run
uv run --with pyyaml -- python scripts/verify_evidence.py --bundle <ghost-repo>/verification/evidence; non-zero exit means extraction is incomplete. - Strict mode is default and fail-closed. Use
--legacy-allow --legacy-reason "<rationale>"only for explicit manual break-glass migrations. - For stochastic agentic systems:
- run scenarios in two modes:
- deterministic debug mode (stable tool outputs; fixed seed when possible)
- production-like mode (real sampling settings)
- run each critical scenario N times and record pass rate + cost/latency distributions
- release gates: no critical invariant violations and pass rate meets threshold
- run scenarios in two modes:
- If a full adapter is infeasible:
- run a representative sample across all operation ids (typical + boundary + error)
- document the limitation clearly in
VERIFY.md
- Use
references/verification.mdfor a checklist andVERIFY.mdtemplate.
Reproducibility and regen policy
- The ghost repo must be reproducible: a future developer should be able to point at the upstream revision and rerun the extraction + verification.
- Do not add regeneration scripts as tracked files unless the user explicitly asks; put the recipe in
VERIFY.mdinstead.
Output
Produce only these artifacts in the ghost repo:
README.mdSPEC.mdtests.yamlTESTS_SCHEMA.md(optional; include when tests.yaml has non-trivial harness semantics)INSTALL.mdVERIFY.mdverification/evidence/inventory.jsonverification/evidence/traceability.csvverification/evidence/workflow_loops.jsonverification/evidence/adapter_results.jsonlverification/evidence/mutation_check.jsonverification/evidence/parity.jsonverification/evidence/structure_contract.json(optional, recommended for explicit structure policy)LICENSE*(copied from upstream).gitignore(optional, minimal)
Notes
- Prefer precision over verbosity; rules should be unambiguous and testable.
- Keep the ghost repo free of implementation code and packaging scaffolding.
Zig notes
- Running upstream tests: prefer
zig build test(ifbuild.zigdefines tests); otherwisezig test path/to/file.zigfor the library root and any test entrypoints. - Operation ids for methods: treat a first parameter named
selfof typeT/*Tas an instance method (T#method); otherwise useT.method. comptimeparameters: record allowed values inSPEC.md, and represent them as ordinary fields intests.yamlinputs.- Allocators/buffers: if the API takes
std.mem.Allocatoror caller-provided buffers, specify ownership and mutation rules; assume allocations succeed unless tests cover OOM. - Errors:
- Functional layout: keep
tests.yamlstrict (error: trueonly); in a Zig adapter, treat "any error return" as a passing error case and rely onSPEC.mdfor exact conditions. - Protocol/CLI layout: prefer explicit machine-readable error payload assertions plus exit codes.
- Functional layout: keep
- YAML tooling: Zig stdlib has JSON but not YAML; for adapters/implementations it’s fine to convert
tests.yamlto JSON (or JSONL) as an intermediate and have a Zig runner parse it viastd.json.
Resources
references/templates.md(artifact outlines and YAML format)references/verification.md(verification checklist +VERIFY.mdtemplate)