systematic-debugging
Systematic Debugging
No code fix without an evidenced root cause for the specific behavior being changed.
Context
Systematic debugging is the implementation-side discipline for turning a bug report, failing test, or bad runtime behavior into a defensible fix. It exists to stop guess-first patching, repeated failed fixes, and "works on my machine" improvisation.
This skill belongs in 04-implementation because its primary output is a code-level fix path: isolate the failing boundary, identify the real cause, decide whether the problem is local or structural, and then hand off to tdd and feature-development for the smallest safe change.
It does not replace incident-response. If the issue is still live in production and user impact is active, contain it first through incident-response. Once the system is in a safe mode, use this skill to establish the root cause before writing the code fix.
Inputs
- source-code -- The relevant code paths, configuration boundaries, and recent changes near the failure.
- test-suite -- Existing failing tests, regression coverage, or the closest executable safety net.
- historical-defect-context -- Optional. Prior incidents, defects, or regressions that may match the current symptom.
- fix-lineage-brief -- Optional. Prior fixes, reverts, or workarounds that may narrow the search space.
Process
Step 1: Establish the Current Failure Boundary
Write down the concrete failure first:
- what is failing now
- where it fails
- how to reproduce it
- whether the issue is local, integration-boundary, or release-boundary sensitive
If the issue is a live production incident and containment has not happened yet, stop and route through incident-response before continuing. Do not confuse containment with root-cause work.
Step 2: Pull Historical Context Before Reinventing Theory
If the symptom might match a known regression, incident, or revert pattern, invoke bug-history-retrieval before guessing. Use historical matches to focus the search space, not to skip the investigation.
Search for:
- the same exception or error signature
- the same component boundary
- the same release or deploy window
- prior workarounds that were later reverted
Step 3: Reproduce and Gather Evidence
Reproduce the failure with the smallest reliable loop available:
- a failing automated test when possible
- a deterministic manual repro when automation does not exist yet
- instrumentation at component boundaries when the failure crosses services, queues, jobs, or async hops
Do not move to fixes while the reproduction is unstable or the evidence is still vague.
Step 4: Isolate Root Cause, Not Just Symptom
Identify the narrowest cause that explains the observed failure. Check:
- recent code or config changes
- broken assumptions at contract or compatibility boundaries
- missing regression coverage
- environment or rollout differences
- incorrect workarounds copied from earlier incidents
If the evidence shows a planning, requirements, or architecture mismatch rather than a local code defect, do not keep patching. Produce a course-correction-note and route upstream.
Use this escalation rule:
- first failed fix attempt: gather more evidence
- second failed fix attempt: challenge the current hypothesis explicitly
- third failed fix attempt: assume a structural mismatch until disproven and prepare
course-correction-note
Make that escalation visible, not implicit:
| Failed fix count | Required response |
|---|---|
| first | gather better evidence before changing more code |
| second | challenge the current hypothesis in writing and identify what evidence would falsify it |
| third or more | stop local patching, assume a structural mismatch until disproven, and prepare course-correction-note |
Step 5: Choose the Smallest Safe Fix Path
Once root cause is clear, define the smallest safe change:
- what exact behavior changes now
- what must remain unchanged
- what brownfield seam, compatibility rule, or release boundary must stay protected
- whether the immediate repair is a permanent fix or an explicit workaround
Then hand off to tdd to write the reproducing or regression test before implementation code changes expand.
Step 6: Record the Debugging Result
Produce a bug-fix-report that captures:
- reproduction steps or failing test
- root cause
- evidence used to confirm it
- chosen fix scope
- regression or characterization tests required
- any follow-up debt if a workaround shipped instead of the full repair
Outputs
- bug-fix-report -- Root cause, supporting evidence, fix boundary, regression protection, and follow-up notes for the chosen repair path.
- course-correction-note -- Produced only when repeated failed fixes or hard evidence reveal that the problem belongs upstream in specification, architecture, or planning rather than in local code.
Quality Gate
- The current failure is reproducible or otherwise evidenced with concrete signals
- Root cause is stated as a falsifiable explanation, not a symptom label
- Historical matches were checked when the symptom plausibly had lineage
- The chosen fix path is the smallest safe change for the verified root cause
- Regression or characterization protection is defined before broader implementation proceeds
- Structural mismatches are escalated through
course-correction-noteinstead of patched around
Rationalization Prevention
When one of these thoughts appears, treat it as a stop signal:
| Excuse | Required response |
|---|---|
| "I'll try the obvious fix first and investigate if it fails" | Stop. Return to reproduction and evidence gathering before changing code. |
| "I saw this bug before, so I know the cause" | Use history to narrow the search, then prove the current root cause anyway. |
| "Rollback fixed it, so we know what happened" | Containment is not explanation; identify why the bad behavior existed. |
| "The repro is flaky, but I can still patch around it" | Stabilize the failure boundary or instrument it before writing the fix. |
| "This is probably just environment weirdness" | Name the environment difference and gather evidence that it explains the symptom. |
| "I'll make the broad fix now and tighten it later" | Choose the smallest safe change for the evidenced root cause. |
| "This is the third try, but I think this version will work" | Stop local patching and prepare course-correction-note. |
| "The workaround is enough; we can find root cause later" | Record it explicitly as a workaround only after the current cause is still evidenced. |
| "The failing path is too edge-casey to justify tests" | Edge cases still need regression or characterization coverage. |
| "Another agent already investigated this" | Read the evidence yourself before accepting the conclusion. |
Anti-Patterns
- Fix-first thrashing -- changing code before the failure is reproducible or the cause is understood.
- Containment mistaken for root cause -- assuming rollback or flag disable explains why the bug exists.
- Historical anchoring -- reusing an old fix because the ticket title looks similar.
- Architecture patch cosplay -- applying a local workaround after multiple failed fixes even though the evidence points upstream.
- Regression amnesia -- repairing today's symptom without defining the test that would catch it next time.
Reference Material
For recurring debugging failure modes that create false confidence or repeated misroutes, see Gotchas.
Related Skills
- bug-history-retrieval -- retrieves canonical defect lineage before a new fix theory hardens
- incident-response -- contains live production impact before code-level debugging begins
- tdd -- turns the verified bug boundary into reproducing and regression tests
- feature-development -- implements the smallest safe repair once the test and scope boundary are clear
- verification-before-completion -- verifies the claimed fix with fresh evidence before completion claims
Distribution
- Public install surface:
skills/.curated - Canonical authoring source:
skills/04-implementation/systematic-debugging/SKILL.md - This package is exported for
npx skills add/updatecompatibility. - Packaging stability:
beta - Capability readiness:
beta
More from yknothing/prodcraft
system-design
Use when reviewed requirements or specifications are ready and the team must decide high-level architecture, component boundaries, integration seams, or brownfield coexistence strategy before API design, technology selection, or task planning.
6ci-cd
Use when a reviewed implementation slice needs an automated build, test, and deployment pipeline, especially when brownfield rollback, release-boundary checks, contract/integration gates, and staged delivery must be explicit before shipping.
6intake
The mandatory gateway for all new engineering work. Triage and route new products, apps, features, migrations, tech-debt, or any 'not sure where to start' request to the correct lifecycle path. Use before starting design or implementation. Do not use for ongoing tasks, specific debugging, or PR reviews.
6feature-development
Use when a reviewed task slice has tests or acceptance targets and the team must turn it into a small, mergeable implementation increment without expanding scope, breaking contracts, or hiding release-boundary risk.
6monitoring-observability
Use when a live service or newly delivered release needs actionable telemetry, dashboards, and alerts that expose real user-impactful boundaries, especially when brownfield coexistence rules, unsupported-flow safety, rollback health, or queue/backfill behavior must be visible before incidents escalate.
6incident-response
Use when a live production issue needs coordinated containment, severity triage, stakeholder communication, and evidence capture, especially when a recent release, brownfield coexistence rules, rollback decisions, or unresolved contract boundaries must be handled before root-cause work.
6