fault-diagnosis

Installation

SKILL.md

Fault Diagnosis

Overview

Guessing at fixes wastes time and introduces new defects. Quick patches mask underlying problems.

Core principle: ALWAYS identify root cause before attempting any fix. Treating symptoms is failure.

No exceptions. No workarounds. No shortcuts.

The Prime Directive

NO FIXES WITHOUT ROOT CAUSE INVESTIGATION FIRST

If you have not completed Phase 1, you are not authorized to propose fixes.

When to Use

Apply to ANY technical issue:

Test failures
Production bugs
Unexpected behavior
Performance degradation
Build failures
Integration breakdowns

Especially important when:

Under time pressure (urgency makes guessing tempting)
"Just one quick fix" seems obvious
You have already attempted multiple fixes
A previous fix did not resolve the issue
You do not fully understand the problem

Do not skip when:

The issue appears simple (simple bugs have root causes too)
You are in a hurry (systematic investigation is faster than flailing)
Someone wants it resolved NOW (methodical work is faster than thrashing)

The Four Phases

You MUST complete each phase before advancing to the next.

Phase 1: Root Cause Investigation

BEFORE attempting ANY fix:

Read Error Messages Thoroughly
- Do not skip past errors or warnings
- They frequently contain the exact answer
- Read stack traces completely
- Note line numbers, file paths, error codes
Reproduce Reliably
- Can you trigger it consistently?
- What are the exact reproduction steps?
- Does it happen every time?
- If not reproducible, gather more data -- do not guess
Examine Recent Changes
- What changed that could cause this?
- Git diff, recent commits
- New dependencies, configuration changes
- Environmental differences

Gather Evidence in Multi-Component Systems

WHEN the system has multiple components (CI -> build -> signing, API -> service -> database):

BEFORE proposing fixes, add diagnostic instrumentation:

For EACH component boundary:
  - Log what data enters the component
  - Log what data exits the component
  - Verify environment/config propagation
  - Check state at each layer

Run once to collect evidence showing WHERE it breaks
THEN analyze evidence to identify the failing component
THEN investigate that specific component

Example (multi-layer system):

# Layer 1: Orchestrator
echo "=== Orchestrator state: ==="
echo "TOKEN: ${TOKEN:+SET}${TOKEN:-UNSET}"

# Layer 2: Build script
echo "=== Build environment: ==="
env | grep TOKEN || echo "TOKEN not in environment"

# Layer 3: Signing module
echo "=== Certificate state: ==="
security list-keychains
security find-identity -v

# Layer 4: Actual operation
codesign --sign "$IDENTITY" --verbose=4 "$ARTIFACT"

This reveals: Which layer fails (secrets -> orchestrator OK, orchestrator -> build FAIL)

Trace Data Flow

WHEN the error is deep in the call stack:

See root-cause-tracing.md in this directory for the complete backward tracing method.

Short version:
- Where does the bad value originate?
- What called this function with the bad value?
- Keep tracing upward until you find the source
- Fix at the source, not at the symptom

Phase 2: Pattern Analysis

Find the pattern before fixing:

Locate Working Examples
- Find similar working code in the same codebase
- What works that resembles what is broken?
Compare Against References
- If implementing a pattern, read the reference implementation COMPLETELY
- Do not skim -- read every line
- Understand the pattern fully before applying
Identify Differences
- What differs between working and broken?
- List every difference, no matter how small
- Do not assume "that cannot matter"
Understand Dependencies
- What other components does this require?
- What settings, configuration, environment?
- What assumptions does it make?

Phase 3: Hypothesis and Testing

Scientific method:

Form a Single Hypothesis
- State clearly: "I believe X is the root cause because Y"
- Write it down
- Be specific, not vague
Test Minimally
- Make the SMALLEST possible change to test the hypothesis
- One variable at a time
- Do not fix multiple things simultaneously
Verify Before Continuing
- Did it work? Yes -> Phase 4
- Did not work? Form a NEW hypothesis
- DO NOT pile additional fixes on top
When You Do Not Know
- Say "I do not understand X"
- Do not pretend to know
- Ask for help
- Research further

Phase 4: Implementation

Fix the root cause, not the symptom:

Create a Failing Test Case
- Simplest possible reproduction
- Automated test if possible
- One-off test script if no framework available
- MUST exist before fixing
- Use the godmode:test-first skill for writing proper failing tests
Implement a Single Fix
- Address the root cause identified
- ONE change at a time
- No "while I'm here" improvements
- No bundled refactoring
Verify the Fix
- Test passes now?
- No other tests broken?
- Issue actually resolved?
If the Fix Does Not Work
- STOP
- Count: How many fixes have you attempted?
- If < 3: Return to Phase 1, re-analyze with new information
- If >= 3: STOP and question the architecture (step 5 below)
- DO NOT attempt fix #4 without architectural discussion
If 3+ Fixes Failed: Question Architecture

Pattern indicating an architectural problem:
- Each fix reveals new shared state/coupling/problems in different locations
- Fixes require "massive refactoring" to implement
- Each fix creates new symptoms elsewhere
STOP and question fundamentals:
- Is this pattern fundamentally sound?
- Are we persisting through sheer inertia?
- Should we refactor the architecture vs. continue fixing symptoms?
Discuss with your human partner before attempting more fixes

This is NOT a failed hypothesis -- this is a flawed architecture.

Guardrails - STOP and Follow Process

If you catch yourself thinking:

"Quick fix for now, investigate later"
"Just try changing X and see what happens"
"Apply multiple changes, run tests"
"Skip the test, I'll verify manually"
"It's probably X, let me fix that"
"I don't fully understand but this might work"
"Pattern says X but I'll adapt differently"
"Here are the main problems: [lists fixes without investigation]"
Proposing solutions before tracing data flow
"One more fix attempt" (when already tried 2+)
Each fix reveals new problems in different places

ALL of these mean: STOP. Return to Phase 1.

If 3+ fixes failed: Question the architecture (see Phase 4, step 5)

Human Partner Signals You Are Off Track

Watch for these redirections:

"Is that not happening?" - You assumed without verifying
"Will it show us...?" - You should have added evidence gathering
"Stop guessing" - You are proposing fixes without understanding
"Think deeper" - Question fundamentals, not just symptoms
"We're stuck?" (frustrated) - Your approach is not working

When you see these: STOP. Return to Phase 1.

Cognitive Traps

Rationalization	What Is Actually True
"Issue is simple, process not needed"	Simple issues have root causes too. The process is fast for simple bugs.
"Emergency, no time for process"	Systematic diagnosis is FASTER than guess-and-check flailing.
"Just try this first, then investigate"	The first fix sets the pattern. Do it right from the start.
"I'll write the test after confirming the fix works"	Untested fixes do not hold. Test first proves it.
"Multiple fixes at once saves time"	Cannot isolate what worked. Creates new bugs.
"Reference too long, I'll adapt the pattern"	Partial understanding guarantees bugs. Read it completely.
"I see the problem, let me fix it"	Seeing symptoms is not the same as understanding root cause.
"One more fix attempt" (after 2+ failures)	3+ failures = architectural problem. Question the pattern, do not fix again.

Quick Reference

Phase	Key Activities	Success Criteria
1. Root Cause	Read errors, reproduce, check changes, gather evidence	Understand WHAT and WHY
2. Pattern	Find working examples, compare	Identify differences
3. Hypothesis	Form theory, test minimally	Confirmed or new hypothesis
4. Implementation	Create test, fix, verify	Bug resolved, tests pass

When Investigation Reveals No Root Cause

If systematic investigation reveals the issue is truly environmental, timing-dependent, or external:

You have completed the process
Document what you investigated
Implement appropriate handling (retry, timeout, error message)
Add monitoring/logging for future investigation

But: 95% of "no root cause" cases are incomplete investigation.

Supporting Methods

These methods are part of fault diagnosis and available in this directory:

root-cause-tracing.md - Trace bugs backward through the call stack to find the original trigger
defense-in-depth.md - Add validation at multiple layers after finding root cause
condition-based-waiting.md - Replace arbitrary timeouts with condition polling

Related skills:

godmode:test-first - For creating failing test case (Phase 4, Step 1)
godmode:completion-gate - Verify fix worked before declaring success

Real-World Impact

From diagnosis sessions:

Systematic approach: 15-30 minutes to resolution
Random fix approach: 2-3 hours of flailing
First-attempt fix rate: 95% vs 40%
New bugs introduced: Near zero vs common

Related skills

More from noobygains/godmode

Installs

Repository

noobygains/godmode

GitHub Stars

First Seen

Apr 5, 2026

Security Audits

Gen Agent Trust HubPass

SocketPass

SnykPass