Debugging Skill

Provides comprehensive debugging capabilities with integrated extended thinking for complex scenarios.

When to Use This Skill

Activate this skill when working with:

Error troubleshooting
Log analysis
Performance debugging
Distributed system debugging
Memory and resource issues
Complex, multi-layered bugs requiring deep reasoning

Extended Thinking for Complex Debugging

When to Enable Extended Thinking

Use extended thinking (Claude's deeper reasoning mode) for debugging when:

Root Cause Unknown: Multiple possible causes, unclear failure patterns
Intermittent Issues: Race conditions, timing issues, non-deterministic failures
Multi-System Failures: Distributed system bugs spanning multiple services
Performance Mysteries: Unexpected slowdowns without obvious bottlenecks
Complex State Issues: Bugs involving intricate state transitions or side effects
Security Vulnerabilities: Subtle security issues requiring careful analysis

How to Activate Extended Thinking

# In your debugging prompt
Claude, please use extended thinking to help debug this issue:

[Describe the problem with symptoms, context, and what you've tried]

Extended thinking will provide:

Systematic hypothesis generation
Multi-path investigation strategies
Deeper pattern recognition
Cross-domain insights (e.g., network + application + infrastructure)

Hypothesis-Driven Debugging Framework

Use this structured approach for complex bugs:

1. Observation Phase

What happened?
- Error message/stack trace
- Frequency (always/intermittent)
- When it started
- Environmental context
- Recent changes

2. Hypothesis Generation

Generate 3-5 plausible hypotheses:

H1: [Most likely cause based on symptoms]
   Evidence for: [...]
   Evidence against: [...]
   Test: [How to validate/invalidate]

H2: [Alternative explanation]
   Evidence for: [...]
   Evidence against: [...]
   Test: [How to validate/invalidate]

H3: [Edge case or rare scenario]
   Evidence for: [...]
   Evidence against: [...]
   Test: [How to validate/invalidate]

3. Systematic Testing

Priority order (high to low confidence):
1. Test H1 → Result: [Pass/Fail/Inconclusive]
2. Test H2 → Result: [Pass/Fail/Inconclusive]
3. Test H3 → Result: [Pass/Fail/Inconclusive]

New evidence discovered:
- [Finding 1]
- [Finding 2]

Revised hypotheses if needed:
- [...]

4. Root Cause Identification

Confirmed root cause: [...]
Contributing factors: [...]
Why it wasn't caught earlier: [...]

5. Fix + Validation

Fix implemented: [...]
Tests added: [...]
Validation: [...]
Prevention: [...]

Structured Debugging Templates

Template 1: MECE Bug Analysis (Mutually Exclusive, Collectively Exhaustive)

## Bug: [Title]

### Problem Statement
- **What**: [Precise description]
- **Where**: [System/component]
- **When**: [Conditions/triggers]
- **Impact**: [Severity/scope]

### MECE Hypothesis Tree

**Layer 1: System Boundaries**
- [ ] Frontend issue
- [ ] Backend API issue
- [ ] Database issue
- [ ] Infrastructure/network issue
- [ ] External dependency issue

**Layer 2: Component-Specific** (based on Layer 1 finding)
- [ ] [Sub-component A]
- [ ] [Sub-component B]
- [ ] [Sub-component C]

**Layer 3: Code-Level** (based on Layer 2 finding)
- [ ] Logic error
- [ ] State management
- [ ] Resource handling
- [ ] Configuration

### Investigation Log
| Time | Action | Result | Next Step |
|------|--------|--------|-----------|
| [HH:MM] | [What you tested] | [Finding] | [Decision] |

### Root Cause
[Final determination with evidence]

### Fix
[Solution with rationale]

Template 2: 5 Whys Analysis

## Issue: [Brief description]

**Symptom**: [Observable problem]

**Why 1**: Why did this happen?
→ [Answer]

**Why 2**: Why did [answer from Why 1] occur?
→ [Answer]

**Why 3**: Why did [answer from Why 2] occur?
→ [Answer]

**Why 4**: Why did [answer from Why 3] occur?
→ [Answer]

**Why 5**: Why did [answer from Why 4] occur?
→ [Root cause]

**Fix**: [Addresses root cause]
**Prevention**: [Process/check to prevent recurrence]

Template 3: Timeline Reconstruction

## Incident Timeline: [Event]

**Goal**: Reconstruct exact sequence leading to failure

| Time | Event | System State | Evidence |
|------|-------|--------------|----------|
| T-5min | [Normal operation] | [State] | [Logs] |
| T-2min | [Trigger event] | [State change] | [Logs/metrics] |
| T-30s | [Cascade starts] | [Degraded] | [Alerts] |
| T-0 | [Failure] | [Failed state] | [Error logs] |
| T+5min | [Recovery action] | [Recovering] | [Actions taken] |

**Critical Path**: [Sequence of events that led to failure]
**Alternative Scenarios**: [What could have prevented it at each step]

Python Debugging Patterns

Hypothesis-Driven Python Debugging Example

```python """ Bug: API endpoint returns 500 error intermittently Symptoms: 1 in 10 requests fail, always with same user IDs Hypothesis: Race condition in user data caching """

H1: Cache key collision between users

Test: Add detailed logging around cache operations

import logging logging.basicConfig(level=logging.DEBUG)

def get_user(user_id): cache_key = f"user:{user_id}" logging.debug(f"Fetching cache key: {cache_key} for user {user_id}")

cached = cache.get(cache_key)
if cached:
    logging.debug(f"Cache hit: {cache_key} -> {cached}")
    return cached

user = db.query(User).filter_by(id=user_id).first()
logging.debug(f"DB fetch for user {user_id}: {user}")

cache.set(cache_key, user, timeout=300)
logging.debug(f"Cache set: {cache_key} -> {user}")

return user

Result: Discovered cache_key had different format in different code paths

Root cause: String formatting inconsistency (f"user:{id}" vs f"user_{id}")

```

Advanced Debugging with Context Managers

```python import time from contextlib import contextmanager

@contextmanager def debug_timer(operation_name): """Time operations and log if slow""" start = time.perf_counter() try: yield finally: duration = time.perf_counter() - start if duration > 1.0: # Slow operation threshold logging.warning( f"{operation_name} took {duration:.2f}s", extra={'operation': operation_name, 'duration': duration} )

Usage

with debug_timer("database_query"): results = db.query(User).filter(...).all()

@contextmanager def hypothesis_test(hypothesis_name, expected_outcome): """Test and validate debugging hypotheses""" print(f"\n=== Testing: {hypothesis_name} ===") print(f"Expected: {expected_outcome}") start_state = capture_state() try: yield finally: end_state = capture_state() outcome = compare_states(start_state, end_state) print(f"Actual: {outcome}") print(f"Hypothesis {'CONFIRMED' if outcome == expected_outcome else 'REJECTED'}")

Usage

with hypothesis_test( "H1: Database connection pool exhaustion", expected_outcome="pool_size increases during load" ): # Run load test for i in range(100): api_call() ```

pdb Debugger with Advanced Techniques

```python

Basic breakpoint

import pdb; pdb.set_trace()

Python 3.7+

breakpoint()

Conditional breakpoint

if user_id == 12345: breakpoint()

Post-mortem debugging (debug after crash)

import pdb try: risky_function() except Exception: pdb.post_mortem()

Common pdb commands

n(ext) - Execute next line

s(tep) - Step into function

c(ontinue) - Continue execution

p expr - Print expression

pp expr - Pretty print

l(ist) - Show source code

w(here) - Show stack trace

u(p) - Move up stack frame

d(own) - Move down stack frame

b(reak) - Set breakpoint

cl(ear) - Clear breakpoint

q(uit) - Quit debugger

Advanced: Programmatic debugging

import pdb pdb.run('my_function()', globals(), locals()) ```

Logging

```python import logging

logging.basicConfig( level=logging.DEBUG, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s', handlers=[ logging.FileHandler('debug.log'), logging.StreamHandler() ] )

logger = logging.getLogger(name)

logger.debug("Debug message") logger.info("Info message") logger.warning("Warning message") logger.error("Error message", exc_info=True) ```

Exception Handling

```python import traceback

try: result = risky_operation() except Exception as e: # Log full traceback logger.error(f"Operation failed: {e}") logger.error(traceback.format_exc())

# Or get traceback as string
tb = traceback.format_exception(type(e), e, e.__traceback__)
error_details = ''.join(tb)

```

JavaScript/Node.js Debugging

Hypothesis-Driven JavaScript Debugging Example

```javascript /**

Bug: Memory leak in websocket connections
Symptoms: Memory grows over time, eventually crashes
Hypothesis: Event listeners not cleaned up on disconnect */

// H1: Event listeners accumulating // Test: Track listener counts class WebSocketManager { constructor() { this.connections = new Map(); this.debugListenerCounts = true; }

addConnection(userId, socket) { console.debug(`[H1 Test] Adding connection for user ${userId}`);

if (this.debugListenerCounts) {
  console.debug(\`[H1] Listener count before: \${socket.listenerCount('message')}\`);
}

socket.on('message', (data) => this.handleMessage(userId, data));
socket.on('close', () => this.removeConnection(userId));

if (this.debugListenerCounts) {
  console.debug(\`[H1] Listener count after: \${socket.listenerCount('message')}\`);
}

this.connections.set(userId, socket);

}

removeConnection(userId) { console.debug(`[H1 Test] Removing connection for user ${userId}`);

const socket = this.connections.get(userId);
if (socket) {
  const messageListenerCount = socket.listenerCount('message');
  console.debug(\`[H1] Listeners still attached: \${messageListenerCount}\`);

  // Result: Found 3+ listeners on same event!
  // Root cause: Not removing listeners on reconnect
  socket.removeAllListeners();
  this.connections.delete(userId);
}

} } ```

Advanced Console Debugging

```javascript // Basic logging console.log('Basic log'); console.error('Error message'); console.warn('Warning');

// Object inspection with depth console.dir(object, { depth: null, colors: true }); console.table(array);

// Performance timing console.time('operation'); // ... code ... console.timeEnd('operation');

// Memory usage console.memory; // Chrome only

// Stack trace console.trace('Trace point');

// Grouping for organized logs console.group('User Authentication Flow'); console.log('Step 1: Validate credentials'); console.log('Step 2: Generate token'); console.groupEnd();

// Conditional logging const debug = (label, data) => { if (process.env.DEBUG) { console.log(`[DEBUG] ${label}:`, JSON.stringify(data, null, 2)); } };

// Hypothesis testing helper function testHypothesis(name, test, expected) { console.group(`Testing: ${name}`); console.log(`Expected: ${expected}`); const actual = test(); console.log(`Actual: ${actual}`); console.log(`Result: ${actual === expected ? 'PASS' : 'FAIL'}`); console.groupEnd(); return actual === expected; }

// Usage testHypothesis( 'H1: Cache returns stale data', () => cache.get('key').timestamp, Date.now() ); ```

Debugging Async/Promise Issues

```javascript // Track promise chains const debugPromise = (label, promise) => { console.log(`[${label}] Started`); return promise .then(result => { console.log(`[${label}] Resolved:`, result); return result; }) .catch(error => { console.error(`[${label}] Rejected:`, error); throw error; }); };

// Usage await debugPromise('DB Query', db.users.findOne({ id: 123 }));

// Debugging race conditions async function debugRaceCondition() { const operations = [ { name: 'Op1', fn: async () => { await delay(100); return 'A'; } }, { name: 'Op2', fn: async () => { await delay(50); return 'B'; } }, { name: 'Op3', fn: async () => { await delay(150); return 'C'; } } ];

console.table(results.map(r => r.value)); }

// Debugging memory leaks with weak references class DebugMemoryLeaks { constructor() { this.weakMap = new WeakMap(); this.strongRefs = new Map(); }

trackObject(id, obj) { // Weak reference - will be GC'd if no other references this.weakMap.set(obj, { id, created: Date.now() });

// Strong reference - prevents GC (potential leak source)
this.strongRefs.set(id, obj);

console.log(\`Tracking \${id}: Strong refs=\${this.strongRefs.size}\`);

}

release(id) { this.strongRefs.delete(id); console.log(`Released ${id}: Strong refs=${this.strongRefs.size}`); }

checkLeaks() { console.log(`Potential leaks: ${this.strongRefs.size} strong references`); return Array.from(this.strongRefs.keys()); } } ```

Node.js Inspector

```bash

Start with inspector

node --inspect app.js node --inspect-brk app.js # Break on first line

Debug with Chrome DevTools

Open chrome://inspect

```

VS Code Debug Configuration

```json { "version": "0.2.0", "configurations": [ { "type": "node", "request": "launch", "name": "Debug Agent", "program": "${workspaceFolder}/src/index.js", "env": { "NODE_ENV": "development" } } ] } ```

Container Debugging

Docker

```bash

View logs

docker logs --tail=100 -f

Execute shell

docker exec -it /bin/sh

Inspect container

docker inspect

Resource usage

docker stats

Debug running container

docker run -it --rm
--network=container:
nicolaka/netshoot ```

Kubernetes

```bash

Pod logs

kubectl logs -n agents -f kubectl logs -n agents --previous # Previous crash

Execute in pod

kubectl exec -it -n agents -- /bin/sh

Debug with ephemeral container

kubectl debug -n agents -it --image=busybox

Port forward for local debugging

kubectl port-forward 8080:8080 -n agents

Events

kubectl get events -n agents --sort-by='.lastTimestamp'

Resource usage

kubectl top pods -n agents ```

Log Analysis

Pattern Matching

```bash

Search logs for errors

grep -i "error|exception|failed" app.log

Count occurrences

grep -c "ERROR" app.log

Context around matches

grep -B 5 -A 5 "OutOfMemory" app.log

Filter by time range

awk '/2024-01-15 10:00/,/2024-01-15 11:00/' app.log ```

JSON Logs

```bash

Parse JSON logs with jq

cat app.log | jq 'select(.level == "error")' cat app.log | jq 'select(.timestamp > "2024-01-15T10:00:00")'

Extract specific fields

cat app.log | jq -r '[.timestamp, .level, .message] | @tsv' ```

Performance Debugging

Python Profiling

```python

cProfile

import cProfile cProfile.run('main()', 'output.prof')

Line profiler

@profile def slow_function(): pass

Memory profiler

from memory_profiler import profile

@profile def memory_heavy(): pass ```

Network Debugging

```bash

Check connectivity

ping telnet nc -zv

DNS resolution

nslookup dig

HTTP debugging

curl -v http://localhost:8080/health curl -X POST -d '{"test": true}' -H "Content-Type: application/json" http://localhost:8080/api ```

Common Debug Checklist

Check Logs: Application, system, container logs
Verify Configuration: Environment variables, config files
Test Connectivity: Network, database, external services
Check Resources: CPU, memory, disk space
Review Recent Changes: Git log, deployment history
Reproduce Locally: Same environment, same data
Binary Search: Isolate the problem scope

Debugging Decision Tree

Use this decision tree to determine the right debugging approach:

START: What kind of bug?
│
├─ Known error message/stack trace
│  └─ Use: Direct log analysis + Stack trace walkthrough
│
├─ Intermittent/Race condition
│  └─ Use: Extended thinking + Timeline reconstruction + Hypothesis-driven
│
├─ Performance degradation
│  └─ Use: Profiling + Hypothesis-driven + MECE analysis
│
├─ Distributed system failure
│  └─ Use: Extended thinking + Timeline reconstruction + Multi-system tracing
│
├─ Complex state bug
│  └─ Use: Extended thinking + Hypothesis-driven + pdb/debugger
│
├─ Memory leak
│  └─ Use: Memory profiling + Hypothesis-driven + Weak reference analysis
│
└─ Unknown root cause
   └─ Use: Extended thinking + MECE analysis + 5 Whys

Best Practices for Complex Debugging

1. Document Your Investigation

Always maintain a debugging log:

## Bug Investigation: [Title]
**Start Time**: 2024-01-15 10:00
**Investigator**: [Name]

### Timeline
- 10:00 - Started investigation, checked logs
- 10:15 - Found error pattern in auth service
- 10:30 - Hypothesis: Cache expiration race condition
- 10:45 - Added debug logging, confirmed hypothesis
- 11:00 - Implemented fix, testing

### Hypotheses Tested
- [x] H1: Cache race condition (CONFIRMED)
- [ ] H2: Database connection pool (REJECTED)
- [ ] H3: Network timeout (NOT TESTED)

### Root Cause
[Final determination]

### Fix Applied
[Solution details]

### Prevention
[How to prevent recurrence]

2. Use the Scientific Method

Observe: Gather symptoms, error messages, logs
Hypothesize: Generate 3-5 plausible explanations
Predict: What would you see if hypothesis is true?
Test: Design experiments to validate/invalidate
Analyze: Compare predictions vs actual results
Conclude: Confirm root cause with evidence

3. Leverage Extended Thinking

When to activate extended thinking:

Complexity threshold: More than 3 interacting systems
Uncertainty high: Multiple equally plausible causes
Stakes high: Production outage, security issue, data loss
Pattern unclear: No obvious error messages or logs
Time-sensitive: Need systematic approach under pressure

4. Avoid Common Pitfalls

AVOID:
- ❌ Changing multiple things at once (can't isolate cause)
- ❌ Assuming first hypothesis is correct (confirmation bias)
- ❌ Debugging without logs/evidence (guessing)
- ❌ Not documenting what you tried (repeating failed attempts)
- ❌ Skipping reproduction step (fix might not work)

DO:
- ✅ Change one variable at a time
- ✅ Test multiple hypotheses systematically
- ✅ Add instrumentation before debugging
- ✅ Keep investigation log
- ✅ Write regression test after fix

5. Debugging Instrumentation Patterns

# Python: Comprehensive debugging decorator
import functools
import time
import logging

def debug_trace(func):
    """Decorator to trace function execution with timing and state"""
    @functools.wraps(func)
    def wrapper(*args, **kwargs):
        func_name = func.__qualname__
        logger.debug(f"→ Entering {func_name}")
        logger.debug(f"  Args: {args}")
        logger.debug(f"  Kwargs: {kwargs}")

        start = time.perf_counter()
        try:
            result = func(*args, **kwargs)
            duration = time.perf_counter() - start
            logger.debug(f"← Exiting {func_name} ({duration:.3f}s)")
            logger.debug(f"  Result: {result}")
            return result
        except Exception as e:
            duration = time.perf_counter() - start
            logger.error(f"✗ Exception in {func_name} ({duration:.3f}s): {e}")
            raise

    return wrapper

# Usage
@debug_trace
def complex_operation(user_id, data):
    # Your code here
    pass

// JavaScript: Comprehensive debugging wrapper
function debugTrace(label) {
  return function(target, propertyKey, descriptor) {
    const originalMethod = descriptor.value;

    descriptor.value = async function(...args) {
      console.log(\`→ Entering \${label || propertyKey}\`);
      console.log(\`  Args:\`, args);

      const start = performance.now();
      try {
        const result = await originalMethod.apply(this, args);
        const duration = performance.now() - start;
        console.log(\`← Exiting \${label || propertyKey} (\${duration.toFixed(2)}ms)\`);
        console.log(\`  Result:\`, result);
        return result;
      } catch (error) {
        const duration = performance.now() - start;
        console.error(\`✗ Exception in \${label || propertyKey} (\${duration.toFixed(2)}ms):\`, error);
        throw error;
      }
    };

    return descriptor;
  };
}

// Usage
class UserService {
  @debugTrace('UserService.getUser')
  async getUser(userId) {
    // Your code here
  }
}

Cross-References and Related Skills

Related Skills

This debugging skill integrates with:

extended-thinking (.claude/skills/extended-thinking/SKILL.md)
- Use for: Complex bugs with unknown root causes
- Activation: Add "use extended thinking" to your debugging prompt
- Benefit: Deeper pattern recognition, systematic hypothesis generation
complex-reasoning (.claude/skills/complex-reasoning/SKILL.md)
- Use for: Multi-step debugging requiring logical chains
- Patterns: Chain-of-thought, tree-of-thought for bug investigation
- Benefit: Structured reasoning through complex bug scenarios
deep-analysis (.claude/skills/deep-analysis/SKILL.md)
- Use for: Post-mortem analysis, root cause investigation
- Patterns: Comprehensive code review, architectural analysis
- Benefit: Identifies systemic issues beyond surface bugs
testing (.claude/skills/testing/SKILL.md)
- Use for: Writing regression tests after bug fix
- Integration: Bug → Debug → Fix → Test → Validate
- Benefit: Ensures bug doesn't recur
kubernetes (.claude/skills/kubernetes/SKILL.md)
- Use for: Distributed system debugging in K8s
- Tools: kubectl logs, exec, debug, events
- Integration: Container debugging patterns

When to Combine Skills

Scenario	Skills to Combine	Reasoning
Production outage	debugging + extended-thinking + kubernetes	Complex distributed system requires deep reasoning
Intermittent test failure	debugging + testing + complex-reasoning	Need systematic hypothesis testing
Performance regression	debugging + deep-analysis	Root cause may be architectural
Security vulnerability	debugging + extended-thinking + deep-analysis	Requires careful, thorough analysis
Memory leak	debugging + complex-reasoning	Multi-step investigation needed

Integration Examples

Example 1: Complex Production Bug

# Prompt combining skills
Claude, I have a complex production bug affecting multiple services.
Please use extended thinking and the debugging skill to help investigate.

Symptoms:
- API requests timeout intermittently (1 in 50 requests)
- Only affects authenticated users
- Started after recent deployment
- No obvious errors in logs

Please use:
1. MECE analysis to categorize possible causes
2. Hypothesis-driven debugging framework
3. Timeline reconstruction of recent changes

Example 2: Memory Leak Investigation

# Prompt combining skills
Claude, use complex reasoning and debugging skills to investigate a memory leak.

Context:
- Node.js service memory grows from 200MB to 2GB over 6 hours
- No errors logged
- Happens only in production, not staging

Apply:
1. Hypothesis-driven framework (generate 5 hypotheses)
2. Memory leak detection patterns (weak references)
3. Extended thinking for pattern recognition across codebase

Quick Reference Card

Debugging Workflow Summary

1. OBSERVE
   - Collect error messages, logs, metrics
   - Identify patterns (frequency, conditions, scope)
   - Document symptoms

2. HYPOTHESIZE (use extended thinking if complex)
   - Generate 3-5 plausible hypotheses
   - Rank by likelihood
   - Design tests for each

3. TEST
   - Change one variable at a time
   - Add instrumentation (logging, tracing)
   - Collect evidence

4. ANALYZE
   - Compare predictions vs results
   - Eliminate invalidated hypotheses
   - Refine remaining hypotheses

5. FIX
   - Implement solution
   - Add regression test
   - Document root cause

6. VALIDATE
   - Verify fix in affected environment
   - Monitor metrics
   - Update documentation

Tool Selection Guide

Problem Type	Primary Tool	Secondary Tools
Logic error	pdb/debugger	Logging, unit tests
Performance	Profiler	Hypothesis testing, metrics
Memory leak	Memory profiler	Weak references, heap dumps
Async/timing	Timeline reconstruction	Extended thinking, logging
Distributed	Tracing (logs)	Kubernetes tools, MECE analysis
Unknown cause	Extended thinking	MECE, 5 Whys, hypothesis-driven

Skill version: 2.0 (Enhanced with extended thinking integration) Last updated: 2024-01-15 Maintained by: Golden Armada AI Agent Fleet