hegel
Hegel: Property-Based Testing
Hegel is a family of property-based testing libraries supporting multiple languages, powered by Hypothesis. Tests integrate with standard language test runners. Hegel generates random inputs for your code and automatically shrinks failing cases to minimal counterexamples.
Even when PBTs add modest line coverage over unit tests, their value is in exercising combinations and boundary conditions that humans don't think to write by hand.
Code examples in this file use Python-like pseudocode to illustrate concepts. For exact API and syntax, load the language-specific reference (see step 1 of the workflow).
Workflow
Follow these steps when writing property-based tests.
1. Load the Language Reference
Determine the project language and load the corresponding reference from references/<language>/reference.md for API details and idiomatic patterns.
2. Explore the Code Under Test
Before writing any test, understand what you're testing:
- Read the source code of the function/module under test
- Read existing tests to understand expected behavior and edge cases
- Read docstrings, comments, and type signatures for documented contracts
- Read usage sites to see how callers use the code and what they expect
The goal is to find evidence for properties, not to invent them.
3. Identify Valuable Properties
Look for properties that are:
- Grounded in evidence from the code, docs, or usage patterns
- Non-trivial — they test real behavior, not tautologies, and do not duplicate the code being tested
- Falsifiable — a buggy implementation could actually violate them
Write one test per property. Don't cram multiple properties into one test.
See the Property Catalogue below for a taxonomy of what to look for.
4. Check for Existing Tests to Evolve or Port
Before writing tests from scratch, check what already exists.
Existing PBTs in another framework (proptest, quickcheck, rapid, gopter, etc.) should be ported to hegel. Load the language-specific porting reference (references/<language>/porting.md). Key things to know about hegel when porting:
- Hegel is imperative. Most PBT libraries declare what to generate in a function signature or strategy combinator. In hegel, your test receives a test case handle and calls
tc.draw()whenever it needs a value — you can draw conditionally, in loops, and have later draws depend on earlier values without needingflat_map. - Shrinking is automatic. Hegel's shrinking is handled server-side by Hypothesis. You don't implement shrink logic or define shrinking strategies.
- Standard assertions. Use the language's normal assertion mechanism. No special
prop_assert!or return-a-bool pattern needed. - Broaden your generators. Many existing PBTs use narrow input ranges because shrinking was slow or unreliable. Hegel's shrinking is more robust — try broader generators than the originals.
Unit tests and example-based tests can often be evolved into PBTs. Tests with hardcoded seeds, parameterized examples, or multiple similar test cases are prime candidates. Load references/evolving-tests.md for detailed guidance on recognizing what property a unit test is hiding. If you can't immediately see the right property, start by parameterizing the test — replace concrete values with generated ones and keep a simple oracle. You can refine the property later.
Tests that use rand with fixed seeds are especially good candidates — the randomness should come from hegel instead so failures produce shrinkable counterexamples.
When you evolve an existing test, modify the existing test file rather than creating a new one. Property-based tests are tests like any other and belong with the code they're testing. Do not create a separate file for hegel tests.
5. Write the Tests
For each property:
- Add tests to the appropriate existing test file. Only create a new file if no relevant test file exists.
- Choose the simplest possible generators — see Generator Discipline below.
- Draw values, run the code under test, and assert the property.
6. Run and Reflect
Run the tests. When a test fails, ask:
- Is this a real bug? If the code violates its own contract, flag the bug to the user and ask what to do, or fix the code if instructed to do so.
- Is the property unsound? If you asserted something the code never promised, fix the test.
- Is the generator too broad? Only if the failing input is genuinely outside the function's domain, add constraints. Investigate before constraining.
When NOT to Write PBTs
Property-based tests aren't always the right tool. Prefer unit tests when:
- The test checks exact output.
assert render(doc) == "<html>..."depends on a specific output format — there's no general property to check. - Complex setup dominates. Tests requiring database state, network mocks, or elaborate fixtures are hard to parameterize.
- The test checks specific error messages. Exact error string checks are a unit test concern. PBTs are better for testing that errors are raised, not what they say.
- No property is apparent. If you can't find a meaningful property after reading the code, don't force it. A good unit test beats a contrived PBT.
Property Catalogue
Use this catalogue to identify what to test. Not every category applies to every function — pick the ones supported by evidence from the code.
The first five patterns are ordered by how often they've found real bugs in practice.
Tier 1: High-Value Patterns
Model tests — For any data structure, the highest-value first test is a stateful model test: define rules for each operation (insert, remove, get, etc.), run them against both the library under test and a known-good reference (the "model"), and assert they agree after every operation. Use hegel's stateful testing support (see the language reference) rather than hand-rolling the operation loop.
The exact syntax varies significantly by language — check the language reference for the stateful testing API. Conceptually, a model test looks like:
state_machine MyMapTest:
subject = MyMap()
model = HashMap()
rule insert():
k = tc.draw(integers())
v = tc.draw(integers())
subject.insert(k, v)
model.insert(k, v)
rule remove():
k = tc.draw(integers())
subject.remove(k)
model.remove(k)
rule get():
k = tc.draw(integers())
assert subject.get(k) == model.get(k)
invariant agrees:
assert subject == model
Choose the right model: Vec for sequential containers, HashMap for hash maps, BTreeMap/sorted map for ordered maps, HashSet/set for unordered sets.
Idempotence tests — Any normalization, case conversion, or formatting function should satisfy f(f(x)) == f(x). Use full Unicode text generators (not ASCII-only) because Unicode edge cases like ß -> SS and combining characters are where bugs hide.
s = tc.draw(text())
once = normalize(s)
twice = normalize(once)
assert once == twice
Parse robustness — Parsers (from_str, parse, decode) should handle all input without panicking. The property is simple: it should never crash, even on garbage input.
s = tc.draw(text())
_ = MyType.parse(s) # should return an error, never panic
Roundtrip tests — parse(format(x)) == x for any serialize/deserialize pair. Test with the full input domain. Bugs hide at zero (scientific notation edge cases), large integers (precision loss through f64 for values > 2^53), and unusual string content.
n = tc.draw(integers())
s = format(n)
assert parse(s) == n
Boundary value tests — Integer boundary values (MIN, MAX, 0) are where overflow bugs hide. Don't add bounds to avoid them — they ARE the test. Negating MIN overflows, intermediate products overflow, GCD/LCM computations overflow on boundary inputs.
a = tc.draw(integers()) # includes MIN, MAX, 0
b = tc.draw(integers())
tc.assume(b != 0)
result = my_numeric_op(a, b) # should not overflow/panic
Tier 2: General Property Categories
| Category | Description | Example |
|---|---|---|
| Commutativity | order of operations doesn't matter | a + b == b + a or f(g(x)) == g(f(x)) |
| Invariant preservation | an operation maintains a structural property | insert into BST preserves ordering |
| Oracle / reference impl | compare against a known-correct implementation | my_sort(xs) == std_sort(xs) |
| Monotonicity | more input means more (or equal) output | len(xs ++ ys) >= len(xs) |
| Bounds / contracts | output stays within documented limits | clamp(x, lo, hi) is in [lo, hi] |
| No-crash / robustness | function handles all valid inputs without panicking | parse(arbitrary_string) doesn't crash |
| Equivalence | two implementations produce the same result | iterative_fib(n) == recursive_fib(n) |
| Consistency | related APIs in the same library agree | string_width(s) == sum(char_width(c) for c in s) |
| Large input sizes | exercise deep structure paths that small inputs miss | draw size separately, force 50-200+ elements for trees/tries |
| Feature flag testing | non-default features are often less tested | enable SIMD, nightly, or experimental features and run tests |
Bug Patterns by Category
| Category | What to look for |
|---|---|
| Integer overflow | Boundary values (MIN, MAX, 0) in arithmetic, GCD, negation, display |
| Idempotence failure | Case conversion / normalization with Unicode (ß -> SS), word splitting on case transitions |
| Precision loss | Numbers routed through f64 lose precision for integers > 2^53 |
| Roundtrip failure | Format/parse on edge cases: zero, empty strings, unusual path components |
| Parse panic | from_str delegates to a constructor that panics instead of returning Err |
| Stale state | Update operations that modify one index but don't clean up the old entry in another |
| Unicode line breaks | \u{85} (NEL), \u{2028} (LS), \u{2029} (PS) treated inconsistently as line breaks |
| SIMD divergence | SIMD code path produces different results than the scalar fallback |
| Deep structure bugs | Traversal that only fails when data structure has multiple internal levels (50-200+ elements) |
Choosing Properties
Properties must be evidence-based. Find evidence in:
- Names and type signatures: A function
merge(a: List, b: List) -> Listimplies the output length might equal the sum of input lengths. - Docstrings and comments: "Returns a sorted list" directly gives you an invariant.
- Assertions and debug checks in the source: These are properties the author already identified — they may suggest other invariants.
- Usage patterns: If callers always assume a result is non-empty, assert that.
- Existing tests: Unit tests often encode specific instances of general properties.
Err on the side of creating more properties rather than fewer, and if they fail investigate whether the failure is legitimate behavior or not.
Beware of properties that seem universal but aren't. Read the docs carefully before asserting a property. Examples from real testing:
- Grapheme-based string reverse is NOT an involution (
reverse(reverse("\n\r")) != "\n\r"because\r\nis one grapheme cluster while\n\ris two). - A method called
differencemight mean symmetric difference (A triangle B), not set difference (A \ B) — check the docs. - A function documented as "returns the largest key <= k" means <=, not <.
When a property fails, investigate whether it's a real bug or a genuine edge case in the domain. A weaker property often still holds.
Generator Discipline
The most common mistake when writing property-based tests is over-constraining generators. Broad generators find more bugs because they explore inputs the developer didn't anticipate. Constrained generators give a false sense of safety.
Start With No Bounds
If the function accepts any integer, generate any integer:
n = tc.draw(integers()) # full range of the type, no min/max
Preemptively adding bounds like .min(0).max(100) means you'll never discover that the function overflows on large values, mishandles negatives, or breaks at the type's boundaries. Those are exactly the bugs PBT is designed to find.
Edge Cases Are the Point
Don't narrow ranges to "avoid edge cases." If a function claims to work on all integers, test it on all integers — including MIN, MAX, 0, -1, and 1. If it breaks, that's valuable information.
Don't Require Non-Empty by Default
Unless the function's contract explicitly requires non-empty input, test with empty collections too. If a function panics on an empty collection, that might be a bug worth knowing about.
When a Test Fails on Extreme Values
Assume it's a real bug unless you have strong evidence otherwise. If in doubt, ask the user.
- If the function's documentation says it handles all integers but it overflows on
MAX, that's a bug in the code, not in your test. - Only add bounds after investigating and confirming the input is outside the function's documented domain.
When to Add Constraints
Add generator bounds only when:
- The function's contract explicitly excludes some inputs. For example, a square root function documents that input must be >= 0.
- You need to avoid undefined behavior. For example, division by zero.
- A test failure has been investigated and confirmed to be outside the function's domain.
Avoid Rejection Sampling Where Possible
When a constraint involves relationships between multiple generated values, you might use tc.assume():
a = tc.draw(integers())
b = tc.draw(integers())
tc.assume(a != b) # this is fine for simple constraints
But it's better to construct valid inputs directly when you can:
# Instead of tc.assume(a <= b), generate in order:
a = tc.draw(integers())
b = tc.draw(integers())
if a > b:
a, b = b, a
This is particularly important when the rejection rate would be high. For example, integers().map(n -> n * 2) is much better than integers().filter(n -> n % 2 == 0) — the latter throws away ~50% of test cases.
Getting Large Collections
Hegel's default collection size is small. If you need large collections (e.g., to exercise deep tree paths or multi-level node structures), draw the size separately:
# can generate large collections, and hegel can shrink n to find the minimal size
n = tc.draw(integers(min=0, max=300))
keys = tc.draw(lists(integers(), min_size=n)) # no max_size — let hegel go bigger
# BAD — hegel's default size distribution rarely produces 100+ elements
keys = tc.draw(lists(integers()))
Use Unique Element Generation for Key Generation
When testing maps/sets that need unique keys, use the unique option on collection generators. This avoids confusion about which value wins for duplicate keys. See the language-specific reference for syntax.
Handling Randomness in Code Under Test
When the code under test requires an RNG, do not create a seeded RNG with a hegel-generated seed. Hegel can only shrink the seed integer, not the actual random decisions the RNG makes — so when a test fails, you get a meaningless minimal seed rather than a meaningful minimal sequence of random choices.
Instead, use hegel's random generator, which gives you an RNG that routes random decisions through hegel's shrinking engine. See the language-specific reference for the exact API.
Two modes: artificial vs true randomness
- Default (artificial randomness): Every random decision goes through hegel, enabling fine-grained shrinking of individual random values. Best for most code.
- True randomness mode: Generates a single seed via hegel, then creates a real RNG from it. Hegel can only shrink the seed, not individual random decisions. Use this when the code under test does rejection sampling or otherwise depends on the RNG producing statistically random-looking output — artificial randomness can cause rejection loops to hang.
How to choose: Start with the default. If tests hang or time out because the code does rejection sampling internally, switch to true randomness mode.
Refactoring concrete RNG types
If the code under test takes a concrete RNG type rather than a trait/interface, consider whether it should be refactored to accept a generic RNG. This is both better API design and makes the code testable with hegel's random generator. Suggest this refactoring to the user.
Common Mistakes
-
Over-constraining generators — Adding bounds "just in case" means the test will never find bugs at boundary values or with unexpected inputs. The whole value of PBT is exploring the input space the developer didn't think to test by hand. See Generator Discipline above.
-
Testing trivial properties —
assert x == xorassert len(vec) >= 0test nothing useful. Every property should be falsifiable by a buggy implementation. -
Using the implementation as the oracle — If your test calls the same function to compute the expected result, it can never fail. Use an independent reference implementation, a simpler algorithm, or a structural property.
-
High rejection rates — If
.filter()ortc.assume()rejects most inputs, hegel will give up. Restructure generators to produce valid inputs directly (use.map()or dependent draws). -
Creating a separate test file for hegel tests — Property-based tests belong alongside the existing tests for the same code. Add them to existing test files.
-
Using manually seeded RNGs — Use hegel's random generator so hegel controls the random decisions and can shrink them individually. See "Handling Randomness" above.
-
Overflowing in test code — When computing values from generated data (e.g.,
map.insert(k, k * 10)), your test code itself can overflow before the library has a chance to be buggy. Use wrapping arithmetic or draw a smaller type and widen it to prevent overflow in the test. Distinguish "this constraint protects the library's contract" (keep it) from "this constraint prevents my test from overflowing" (use wrapping arithmetic instead). -
Restricting collection size for performance — If a test is slow with large collections, lower the test case count rather than restricting the input space. A slow test that finds bugs beats a fast test that can't. Many tree/trie bugs only manifest at 50-200+ elements.