edd
EDD — Eval-Driven Development
TDD is for code. EDD is for context.
Write behavioral assertions about how your agent should behave. Engineer context until assertions pass. Never ship a harness change without evidence it helped.
The Boundary Rule
Will this context/prompt run more than ~20 times? → Use EDD.
Still figuring out what "good" means? → Don't use it yet. Explore first.
EDD applies to harness artifacts — system prompts, tool definitions, retrieval strategies, instruction structures, few-shot examples. It does NOT apply to feature code (use TDD) or one-off prompts (use your eyes).
When It Shines
- Iterating on a reusable harness — every change validated, regressions caught
- "It used to work" — diff scores against last known-good state
- Competing approaches — "few-shot vs. detailed instructions?" Run both, compare scores
- Handoffs — assertions are executable documentation of what the harness does
- Safety/compliance — "must never expose PII" is a natural assertion
- Diminishing returns — score curve tells you when to stop tweaking
When It Doesn't
- One-off tasks — writing assertions costs more than reading the output
- Exploratory phase — you don't know what good looks like yet
- Creative/subjective output — quality is taste, not measurable
- Rapid prototyping — eval friction kills exploration speed
- Simple mechanical prompts — if you can eyeball correctness in 2 seconds, skip it
The EDD Loop
digraph edd {
rankdir=TB;
node [shape=box];
define [label="1. DEFINE\nWrite 5-15 assertions\nacross categories"];
cases [label="2. TEST CASES\n3-5 representative inputs"];
baseline [label="3. BASELINE\nRun current harness\nagainst assertions\n(3-5 runs per case)"];
change [label="4. CHANGE ONE THING\nModify one context variable"];
eval [label="5. EVAL\nRun against assertions"];
regressed [label="Regressed?" shape=diamond];
improved [label="Improved?" shape=diamond];
allpass [label="All passing?" shape=diamond];
simplify [label="6. SIMPLIFY\nRemove non-load-bearing\ncontext (assertions\nstill pass without it?)"];
suite [label="7. ADD TO SUITE\nAssertions join\nregression suite"];
revert [label="Revert change\ntry different approach"];
fix [label="Fix regression\nbefore continuing"];
define -> cases -> baseline -> change -> eval -> regressed;
regressed -> fix [label="yes"];
fix -> eval;
regressed -> improved [label="no"];
improved -> change [label="no — revert & retry"];
improved -> allpass [label="yes"];
allpass -> change [label="no"];
allpass -> simplify [label="yes"];
simplify -> suite;
}
Key Discipline
- Step 4 is "change ONE thing" — not three. Otherwise you can't attribute improvement.
- Step 5 checks regressions first — a change that improves 3 assertions but breaks 2 is not progress.
- Step 6 is the pruning step — once green, remove context. If assertions still pass without it, it was deadwood.
- Multiple runs per eval — LLM output is stochastic. A single run proves nothing.
Assertion Taxonomy
Assertions are the core of EDD. Bad assertions give false confidence. Good assertions catch real regressions.
Behavioral — What the agent DOES
What actions, tool calls, or decisions the agent makes given the context.
"Agent calls search_codebase before generating code"
"Agent asks for clarification when the task is ambiguous"
"Agent creates a test file before the implementation file"
Safety — What the agent must NOT do
Hard boundaries that should never be crossed regardless of input.
"Agent never exposes API keys in output"
"Agent refuses to modify production database without confirmation"
"Agent does not hallucinate tool names that don't exist"
Structural — What the output LOOKS like
Format, structure, and organization of the output.
"Response includes a code block with the solution"
"Output follows the project's naming conventions"
"Generated files are placed in the correct directory"
Quality — How GOOD the output is
Precision, accuracy, and depth of the agent's work.
"Generated code handles the edge case described in the harness"
"Agent uses internal terminology, not generic alternatives"
"Solution addresses the root cause, not just the symptom"
Efficiency — How much WASTE is avoided
Work that the context should eliminate.
"Agent doesn't ask questions already answered in the harness"
"Agent takes fewer tool calls than baseline to reach same outcome"
"Agent doesn't generate-then-discard incorrect approaches"
Writing Good Assertions
| Do | Don't |
|---|---|
| Specific and verifiable | Vague ("agent should be helpful") |
| Observable from output | Requires reading the agent's mind |
| Discriminating (fails without harness) | Passes regardless of context |
| Independent (one thing per assertion) | Compound ("does X AND Y AND Z") |
Litmus test: Can a reviewer grade this as PASS/FAIL in under 30 seconds by reading the output? If not, make it more specific.
Confidence and Stochasticity
LLM outputs are non-deterministic. A single run is anecdote, not evidence.
How Many Runs?
| Context | Minimum runs per eval |
|---|---|
| Quick iteration (exploring) | 3 |
| Confident change (shipping) | 5 |
| High-stakes (safety/compliance) | 10+ |
What Counts as "Passing"?
An assertion passes for a given eval when it passes in at least 80% of runs (e.g., 4/5, 8/10). Adjust threshold based on stakes:
- Convenience harness: 70% may be acceptable
- Safety assertion: 100% or it's not passing
Detecting Flaky Assertions
An assertion is flaky if it passes in 40-60% of runs consistently. Flaky assertions are either:
- Poorly written — the assertion is ambiguous, so grading varies. Fix the assertion.
- At the model's capability boundary — the context helps sometimes but not reliably. Accept the flakiness or redesign the context approach.
Don't ignore flaky assertions. Diagnose them.
Grading
Deterministic Grading (Preferred)
When possible, grade assertions automatically:
- String/regex matching on output
- Tool call sequence verification
- File existence / content checks
- Structured output validation
LLM-as-Judge
When assertions require judgment ("uses appropriate terminology", "handles edge case correctly"):
- Use a separate LLM call with the assertion and the output
- Provide explicit grading criteria, not just the assertion text
- Be aware: LLM judges are lenient by default. Add "be skeptical — surface-level compliance is a FAIL"
Human Review
For high-stakes assertions or when you distrust automated grading:
- Present output + assertion to the developer
- Binary PASS/FAIL, no partial credit
- Batch reviews to reduce friction
Integration with context-eval
EDD is the methodology. context-eval is the measurement engine.
| Concern | EDD | context-eval |
|---|---|---|
| When to write assertions | Yes — before any harness change | No — takes assertions as input |
| How to structure the eval loop | Yes — change one thing, check regressions | No — runs a single comparison |
| How to measure delta | No — delegates this | Yes — pass rates, benefit-per-kilotoken |
| How to grade outputs | No — delegates this | Yes — grading protocol |
| When to stop iterating | Yes — diminishing returns detection | No — reports results, doesn't advise |
The handoff: EDD defines what to measure and when. context-eval does the measurement. EDD interprets the results and decides next steps.
To run an eval cycle, use context-eval's eval loop (steps 2-7) with EDD's assertions and test cases as input. EDD adds the outer loop: which variable to change, regression checking, and the simplification pass.
The Simplification Pass
Once all assertions pass, EDD borrows from the bonsai discipline: remove context to verify what's load-bearing.
For each section/instruction in the harness:
1. Remove it temporarily
2. Run the eval suite
3. Did any assertion regress?
Yes → it's load-bearing. Keep it.
No → it's deadwood. Cut it permanently.
This is the highest-leverage step in the loop. Most harnesses carry 20-40% deadwood — context that costs tokens but doesn't change behavior. Cutting it improves latency, reduces cost, and often improves output quality by reducing noise.
Managing the Eval Suite Over Time
When to Add Assertions
- New capability added to the harness → add assertions for the new behavior
- Bug found in production → add a regression assertion before fixing
- User reports unexpected behavior → encode the expectation as an assertion
When to Retire Assertions
- The harness no longer claims to address that behavior
- The assertion hasn't failed in 10+ consecutive eval cycles (consider promoting to a spot-check)
- The assertion is non-discriminating (passes with and without harness)
Suite Hygiene
- Review the full suite every ~10 harness iterations
- Flag and fix flaky assertions — they erode trust in the suite
- Keep the suite runnable in under 5 minutes for quick iteration; maintain a separate "full suite" for pre-ship validation
Anti-Patterns
| Anti-pattern | Symptom | Fix |
|---|---|---|
| Vibes-driven iteration | "It seems better" without evidence | Run the eval. Numbers don't lie. |
| Changing multiple variables | Can't tell which change helped | One change per cycle. Always. |
| Assertion-free shipping | Harness changes go out without eval | No commit without a green eval run. |
| Testing theater | Assertions that always pass | Check discrimination — does it fail without the harness? |
| Over-specifying | Assertions so rigid they break on valid variations | Assert behavior, not exact wording. |
| Ignoring regressions | "That assertion wasn't important anyway" | All regressions are blockers until explicitly retired. |
| Skipping the simplification pass | Harness grows monotonically | Prune after every green cycle. |
| Eval suite rot | Suite hasn't been updated in months | Review every ~10 iterations. Retire stale assertions. |
Quick Reference
EDD in 30 seconds:
1. Write assertions (what should the agent do / not do?)
2. Baseline (how does it score now?)
3. Change ONE thing in the harness
4. Eval (did the score improve? any regressions?)
5. Repeat 3-4 until all assertions pass
6. Simplify (remove context that isn't load-bearing)
7. Add assertions to regression suite
More from andurilcode/skills
causal-inference
Apply causal inference whenever the user is interpreting metrics, debugging system behavior, reading A/B test results, or trying to understand whether an observed change was caused by an action or by something else. Triggers on phrases like "X caused Y", "since we deployed this, metrics changed", "the A/B test showed a lift", "why did this metric move?", "is this correlation or causation?", "we changed X and Y improved", "how do we know this worked?", "the data shows…", or any situation where conclusions are being drawn from observational data. Also trigger before any decision based on metric interpretation — confusing correlation with causation leads to interventions that don't work and misattribution of credit. Never assume causation without applying this skill.
30probabilistic-thinking
Apply probabilistic and Bayesian thinking whenever the user needs to reason under uncertainty, compare risks, prioritize between options, update beliefs based on new evidence, or make decisions without complete information. Triggers on phrases like "what are the odds?", "how likely is this?", "should I be worried about X?", "which risk is bigger?", "does this data change anything?", "is this a signal or noise?", "what's the probability?", "how confident are we?", or any situation where decisions are being made based on incomplete or ambiguous evidence. Also trigger when someone is treating uncertain outcomes as certainties, or when probability language is being used loosely ("probably", "unlikely", "very likely") without quantification. Don't leave uncertainty unexamined.
27cognitive-bias-detection
Apply cognitive bias detection whenever the user (or Claude itself) is making an evaluation, recommendation, or decision that could be silently distorted by systematic thinking errors. Triggers on phrases like "I'm pretty sure", "obviously", "everyone agrees", "we already invested so much", "this has always worked", "just one more try", "I knew it", "the data confirms what we thought", "we can't go back now", or when analysis feels suspiciously aligned with what someone wanted to hear. Also trigger proactively when evaluating high-stakes decisions, plans with significant sunk costs, or conclusions that conveniently support the evaluator's existing position. The goal is not to paralyze — it's to flag where reasoning may be compromised so it can be corrected.
24analogical-thinking
Apply analogical thinking whenever the user is designing a system, architecture, or process and would benefit from structural patterns that already exist in other domains — or when a problem feels novel but may have been solved elsewhere under a different name. Triggers on phrases like "how should we structure this?", "has anyone solved this before?", "we're designing from scratch", "what's a good model for this?", "I keep feeling like this resembles something", "what patterns apply here?", or when facing architecture, organizational, or process design decisions. Also trigger when a problem has been analyzed thoroughly but no good solution has emerged — the answer may exist in an adjacent domain. Don't reinvent what's been solved. Recognize the shape of the problem first.
22first-principles-thinking
Apply first principles thinking whenever the user is questioning whether a design, strategy, or solution is fundamentally right — not just well-executed. Triggers on phrases like "are we solving the right problem?", "why do we do it this way?", "is this the best approach?", "everyone does X but should we?", "we've always done it this way", "challenge our assumptions", "start from scratch", "is there a better way?", or when the user seems to be iterating on a flawed premise rather than questioning the premise itself. Also trigger when a proposed solution feels like an incremental improvement on something that may be fundamentally broken. Don't optimize a flawed foundation — question it first.
21scenario-planning
Apply scenario planning whenever the user is making long-term decisions, building roadmaps, evaluating strategies, or operating in an environment with significant uncertainty about how the future will unfold. Triggers on phrases like "what should our roadmap look like?", "how do we plan for the future?", "what if things change?", "we're not sure which direction the market will go", "how do we make this strategy resilient?", "what's our plan B?", "what are the different futures we could face?", or when a plan assumes a single future state. Also trigger when someone is over-committed to one expected outcome and hasn't stress-tested the strategy against alternative futures. Don't plan for one future — plan for multiple.
21