test-oracle-generator
Test Oracle Generator
An oracle answers: "is this output correct?" For add(2, 3), the oracle is trivial: == 5. For optimize_route(cities), what's the right answer? The oracle is the hard part.
Oracle types — pick the cheapest that works
| Oracle type | How it works | When available |
|---|---|---|
| Known value | Hardcoded expected output | Small inputs you can compute by hand |
| Reference implementation | Compare against a trusted other implementation | A slow/simple version exists |
| Inverse function | decode(encode(x)) == x |
There's a round-trip |
| Invariant / property | Output satisfies a predicate | You know properties, not values |
| Metamorphic relation | Multiple runs relate in a known way | → metamorphic-test-generator |
| Differential | N implementations should agree | Multiple implementations exist |
| Regression (golden) | Output matches a saved previous output | You trust the current behavior |
Decision flow
Can you compute the answer by hand for small inputs?
│
├─ YES → Known-value oracle for those inputs.
│ Still need something for random/large inputs — continue ↓
│
Is there a simpler/slower implementation that's obviously correct?
│
├─ YES → Reference implementation oracle.
│ `assert fast(x) == slow_obvious(x)` for random x.
│
Is there an inverse? (encode/decode, compress/decompress, serialize/parse)
│
├─ YES → Round-trip oracle. `assert decode(encode(x)) == x`
│
Do you know properties the output must have, even if not the exact value?
│
├─ YES → Invariant oracle.
│ `result = sort(x); assert is_sorted(result) and is_permutation(result, x)`
│
None of the above?
│
└─ Regression oracle (golden files) as a last resort.
Captures current behavior — not correctness.
Worked example — reference implementation
Under test: fast_median(nums) — O(n) quickselect-based.
Oracle: The obvious O(n log n) version:
def slow_median(nums):
s = sorted(nums)
n = len(s)
return s[n // 2] if n % 2 else (s[n//2 - 1] + s[n//2]) / 2
from hypothesis import given, strategies as st
@given(st.lists(st.integers(), min_size=1))
def test_fast_matches_slow(nums):
assert fast_median(nums) == slow_median(nums)
slow_median is three lines and obviously correct. fast_median is 40 lines of partitioning. Any disagreement is a bug in fast_median.
Worked example — invariant oracle
Under test: schedule(tasks, workers) — assigns tasks to workers, minimizing makespan. NP-hard; you can't compute the optimal answer.
What you can check:
@given(tasks=task_lists(), workers=st.integers(1, 10))
def test_schedule_is_valid(tasks, workers):
assignment = schedule(tasks, workers)
# Every task assigned exactly once
assigned = [t for w in assignment.values() for t in w]
assert sorted(assigned) == sorted(tasks)
# No worker index out of range
assert all(0 <= w < workers for w in assignment)
# Makespan is no worse than the trivial round-robin
# (not optimal — but if we're worse than round-robin, something's very wrong)
trivial = makespan(round_robin(tasks, workers))
assert makespan(assignment) <= trivial
Three invariants. None of them is "the answer is X." All of them catch real bugs.
Worked example — inverse
Under test: serialize(obj) -> bytes and deserialize(bytes) -> obj.
@given(arbitrary_objects())
def test_roundtrip(obj):
assert deserialize(serialize(obj)) == obj
One line. Tests both functions against each other. Catches: field dropped in serialize, wrong type on deserialize, encoding mismatches.
Regression (golden) — the weakest oracle
When nothing else works: run once, save the output, assert future runs match.
def test_render_matches_golden():
output = render(template, data)
golden_path = Path("test/golden/render.txt")
if UPDATE_GOLDEN:
golden_path.write_text(output)
assert output == golden_path.read_text()
This tests stability, not correctness. The first golden might be wrong. Use only when:
- You've manually verified the golden is correct, once.
- The function is mostly-frozen (change = deliberate).
- Better oracles are genuinely unavailable.
Do not
- Do not use the code-under-test as its own oracle.
assert f(x) == f(x)tests nothing. Even subtler: reference implementations that share a buggy helper with the fast version. - Do not default to golden files. They test "same as yesterday," not "correct." A bug that was always there stays there.
- Do not write invariants so weak they never fail.
assert len(result) >= 0— always true, useless. - Do not forget that reference implementations can be wrong too.
slow_medianabove — does it handle the even-length case right? (It does. But check.)
Output format
## Under test
<function — and why the oracle is non-trivial>
## Oracle type
<known-value | reference | inverse | invariant | metamorphic | differential | golden>
## Why this oracle
<decision-flow reasoning — what cheaper oracles were unavailable>
## Oracle implementation
<code — the reference impl, the invariant predicates, the round-trip, etc>
## Test
<code — uses the oracle>
## Oracle validity
<why you trust the oracle — it's simpler, it's from a different codebase, the invariant is from the spec>
More from santosomar/general-secure-coding-agent-skills
code-review-assistant
Performs structured code review on a diff or file set, producing inline comments with severity levels and a summary. Checks correctness, error handling, security, and maintainability — in that priority order. Use when reviewing a pull request, when the user asks for a code review, when preparing code for merge, or when a second opinion is needed on a change.
15dependency-resolver
Diagnoses and resolves package dependency conflicts — version mismatches, diamond dependencies, cycles — across npm, pip, Maven, Cargo, and similar ecosystems. Use when install fails with a resolution error, when two packages require incompatible versions of a third, or when upgrading one dependency breaks another.
4configuration-generator
Generates configuration files for services and tools (app config, logging config, linter config, database config) from a brief description of desired behavior, matching the target format's idioms. Use when bootstrapping a new service, when the user asks for a config file for a specific tool, or when translating config intent between formats.
3ci-pipeline-synthesizer
Generates CI pipeline configs by analyzing a repo's structure, language, and build needs — GitHub Actions, GitLab CI, or other platforms. Use when bootstrapping CI for a new repo, when porting from one CI to another, when the user asks for a pipeline that builds and tests their project, or when wiring in security gates.
3api-design-assistant
Reviews and designs API contracts — function signatures, REST endpoints, library interfaces — for usability, evolvability, and the principle of least surprise. Use when designing a new public interface, when reviewing an API PR, when the user asks whether a signature is well-designed, or when planning a breaking change.
2code-refactoring-assistant
Executes refactorings — extract method, inline, rename, move — in small, behavior-preserving steps with a test between each. Use when the user wants to restructure working code, when cleaning up after a feature lands, or when a smell has been identified and needs fixing.
2