dec-bench-evals
DEC Bench Evals
Use this skill when the user wants to create or extend a DEC Bench scenario. The goal is a deterministic, runnable evaluation, not a vague benchmark idea.
Quick Start
Default authoring loop:
dec-bench create --name <id> --domain <domain> --tier <tier>
dec-bench validate --scenario <id>
dec-bench build --scenario <id> --harness <harness> --agent <agent> --model <model> --version <version>
dec-bench run --scenario <id> --harness <harness> --persona naive --mode no-plan
dec-bench results --latest --scenario <id>
dec-bench audit open --scenario <id> --run-id <run-id>
dec-bench registry add --scenario scenarios/<id>
dec-bench registry publish --id <id>
Rules:
- Run
dec-bench validatebeforebuildorrun. - Treat
buildandrunas separate checks: build verifies the image path, run verifies scoring behavior. - Use
resultsto inspect the latest run before opening the audit UI. - Use
audit openfor the browser view, oraudit exportif you only need the bundle. - If the workspace is not a DEC Bench repo, stop and ask whether the user wants a DEC Bench scenario scaffold or only a scenario design proposal.
Before You Scaffold
Decide these first:
- Scenario ID: lowercase, hyphenated, specific to the task.
- Domain: one of
foo-bar,b2b-saas,b2c-saas,ugc,e-commerce,advertising,consumption-based-infra. - Tier:
tier-1,tier-2, ortier-3. - Starting state: broken/incomplete or clean/greenfield.
- Primary competency: the main reasoning skill the eval is testing.
- Harness:
base-rt,classic-de,olap-for-swe, or a justified custom harness. - Success criteria: concrete pass/fail checks, not subjective judgments.
Prefer the smallest tier that still exercises the intended competency. Keep the starting state deterministic and easy to reset.
What Good Evals Look Like
Good DEC Bench scenarios:
- test one clear workflow or failure mode
- use realistic but compact seed data
- make the agent resolve observable constraints
- score behavior with deterministic assertions
- keep setup reproducible across repeat runs
Avoid:
- LLM-as-judge scoring
- vague tasks like "improve the pipeline"
- hidden state that changes between runs
- prompts that move the goalposts between personas
- assertions with side effects unless the gate explicitly needs rerun behavior
Scaffold Output
dec-bench create generates a scenario directory with:
prompts/naive.mdprompts/savvy.mdinit/assertions/functional.tsassertions/correct.tsassertions/robust.tsassertions/performant.tsassertions/production.tsscenario.jsonsupervisord.conf
Work through those files in that order.
Prompt Rules
Both prompts must target the same acceptance criteria.
naive.md: plain language, minimal implementation hints, no named tools unless the task would naturally mention them.savvy.md: explicit tools, schemas, paths, constraints, and operational details.- Do not make the savvy prompt easier by changing the required outcome.
- Keep prompts specific enough that assertions feel inevitable rather than surprising.
Infrastructure Rules
Use init/ and supervisord.conf to create the starting state.
- Broken/incomplete start: seed healthy-enough infrastructure plus one or more diagnosable defects.
- Clean/greenfield start: seed healthy infrastructure and realistic source data, then let the agent build the missing solution.
- Keep data deterministic.
- Expose connection settings through environment variables that both the agent and assertions can consume.
- Start only the services the scenario needs.
In supervisord.conf:
- use explicit programs and startup order
- keep
autorestart=false - avoid incidental background services
Assertion Rules
Assertions are the core of the eval. Write scenario assertions only; the framework provides universal core assertions.
- Each exported async function should test one thing.
- Function names become assertion keys in the scoring output.
- Return
AssertionResultwithpassedplus actionablemessageand usefuldetails. - Keep assertions deterministic, fast, and side-effect free unless rerun behavior is the point.
- Put helper functions like
queryRows<T>()inside the same gate file. - Prefer database and artifact checks over log-text heuristics.
Use the framework context:
ctx.clickhousefor ClickHouse queriesctx.postgresfor Postgres queriesctx.env()for connection settings and other environment variables
Gate model:
- Functional: it runs
- Correct: it is right
- Robust: it handles messy or repeated execution
- Performant: it meets runtime or query thresholds
- Production: you would ship it
A gate only counts if earlier gates pass. Scenario assertions must clear the 80% gate threshold together with the framework's core assertions.
scenario.json Rules
Populate at least these fields:
idtitledescriptiontierdomainharnesstaskspersonaPromptsinfrastructuretagsbaselineMetricsreferenceMetrics
Important details:
personaPromptsshould point toprompts/naive.mdandprompts/savvy.md.tasks[]should be concrete and categorized.infrastructure.servicesandinfrastructure.descriptionshould describe the actual starting state.- Baseline and reference metrics should be plausible, not aspirational.
For the full contract, enum values, and worked examples, see guide.md.
Harness Guidance
Use the built-in harnesses unless the toolchain requirements are truly new.
base-rt: base infrastructure plus common runtime toolsclassic-de: dbt and heavier data engineering toolingolap-for-swe: MooseStack-focused workflows
Create a custom harness only when the scenario genuinely needs additional packages or outbound policy changes.
Publishing Flow
To contribute an authored scenario upstream:
dec-bench registry add --scenario scenarios/<id>dec-bench registry publish --id <id>
Use registry publish only after the scenario validates and runs locally.
Expected Output
When this skill activates, produce one of these:
- a concrete scenario proposal with domain, tier, starting state, competency, harness, and assertion plan
- direct edits to scaffolded scenario files
- a targeted extension plan for an existing scenario
Do not stop at a list of ideas. Convert the user request into runnable scenario files or a file-by-file implementation plan.
Additional Resource
Read guide.md for:
- full
scenario.jsonschema and enum values - complete assertion examples for all five gates
- naive vs. savvy prompt examples
- harness selection details
- registry publish flags and review checklist
- skills.sh-compatible installation notes for this skill