Multi-AI Debugging

Overview

multi-ai-debugging provides systematic debugging workflows using multiple AI models as specialized agents. Based on 2024-2025 best practices for AI-assisted debugging with multi-agent architectures.

Purpose: Systematic root cause analysis and fix generation using AI ensemble

Pattern: Task-based (6 independent debugging operations)

Key Principles (validated by tri-AI research):

Multi-Agent Council - Specialized agents debate root causes before consensus
Evaluator/Critic Loops - Fix agent + critic agent verify solutions
Trace-Aware Analysis - Full execution context, not just error messages
Semantic Log Analysis - LLM understanding beyond regex matching
Cross-Stack Correlation - Connect frontend, backend, infra issues
Auto-Remediation - Self-healing patterns where safe

Quality Targets:

Root Cause Identification: >80% accuracy
Time to Diagnosis: <30 minutes for common issues
Fix Generation Success: >60% for known patterns
False Positive Rate: <20% on error classification

When to Use

Use multi-ai-debugging when:

Debugging production incidents
Analyzing error logs and stack traces
Performing root cause analysis (RCA)
Troubleshooting complex multi-service systems
Investigating performance degradation
Understanding cascading failures
Writing post-mortem reports

When NOT to Use:

Simple syntax errors (IDE handles these)
Clear compile-time errors
Well-documented known issues

Prerequisites

Required

Error information (logs, stack trace, error message)
Access to relevant code

Integration

OpenTelemetry traces (ideal)
Log aggregation (CloudWatch, Datadog, etc.)
APM tools (optional)

Operations

Operation 1: Quick Error Diagnosis

Time: 2-5 minutes Automation: 70% Purpose: Fast initial diagnosis for common errors

Process:

Analyze Error:

Diagnose this error:

Error: [PASTE ERROR MESSAGE]
Stack trace: [PASTE STACK TRACE]

Provide:
1. What type of error is this?
2. Most likely root cause (1-2 sentences)
3. Immediate fix suggestion
4. Prevention recommendation

Verify with Gemini (for web research):

gemini -p "Search for solutions to this error:
[ERROR MESSAGE]

Find:
- Common causes
- Stack Overflow solutions
- GitHub issues with fixes"

Output: Quick diagnosis with fix suggestion

Operation 2: Root Cause Analysis (RCA)

Time: 15-45 minutes Automation: 50% Purpose: Deep root cause analysis for complex issues

Process:

Step 1: Gather Context (Context Agent)

# Recent changes
git log --oneline -20
git diff HEAD~5..HEAD --stat

# Related logs
grep -r "ERROR\|Exception\|WARN" logs/ | tail -100

# System state
# Check for relevant metrics, traces, etc.

Step 2: Hypothesis Generation (Analysis Agent)

Perform root cause analysis:

**Error/Symptom**: [DESCRIPTION]

**Context**:
- Recent changes: [GIT LOG]
- Logs: [RELEVANT LOG ENTRIES]
- System state: [METRICS/OBSERVATIONS]
- When started: [TIMESTAMP]
- Affected scope: [USERS/SERVICES]

**Tasks**:
1. List 3-5 probable root causes ranked by likelihood
2. For each hypothesis:
   - Evidence supporting it
   - Evidence against it
   - Confidence level (High/Medium/Low)
3. Recommend investigation steps for top hypothesis

Step 3: Cross-Validate (Verification Agent)

Verify this root cause hypothesis:

Hypothesis: [TOP HYPOTHESIS]
Evidence: [SUPPORTING EVIDENCE]

Tasks:
1. What would we expect to see if this is correct?
2. What would disprove this hypothesis?
3. Design a reproduction test
4. Confidence assessment (0-100)

Step 4: Generate RCA Report

## Root Cause Analysis Report

**Incident**: [DESCRIPTION]
**Date**: [DATE]
**Duration**: [DURATION]
**Impact**: [USERS/SERVICES AFFECTED]

### Timeline
- HH:MM - First error observed
- HH:MM - Investigation began
- HH:MM - Root cause identified
- HH:MM - Fix deployed

### Root Cause
[DETAILED EXPLANATION]

### Contributing Factors
1. [FACTOR 1]
2. [FACTOR 2]

### Resolution
[FIX APPLIED]

### Prevention
1. [ACTION ITEM 1]
2. [ACTION ITEM 2]

Operation 3: Log Analysis & Classification

Time: 5-15 minutes Automation: 80% Purpose: Analyze and classify error logs

Process:

Cluster Log Patterns:

Analyze these log entries:

[PASTE 50-100 LOG LINES]

Tasks:
1. Identify unique error patterns (cluster similar errors)
2. Classify each pattern:
   - Type: (Bug/Config/Network/Resource/Security/User Error)
   - Severity: (Critical/High/Medium/Low)
   - Impact: (Data Loss/Service Down/Degraded/Minor)
3. Count occurrences per pattern
4. Identify the root pattern (original error vs cascading)
5. Recommend priority order for investigation

Semantic Analysis:

Perform semantic analysis on these logs:

[LOG ENTRIES]

Looking for:
- Anomalies in timing/sequence
- Correlation between events
- Hidden dependencies
- Patterns human might miss

Output: Classified and prioritized error report

Operation 4: Multi-Agent Council Debugging

Time: 20-60 minutes Automation: 40% Purpose: Complex issues requiring multiple perspectives

Process:

Launch Parallel Agents:

Launch 4 debugging agents for this issue:

Issue: [DESCRIPTION]
Code: [RELEVANT CODE]
Logs: [RELEVANT LOGS]

Agent 1 (Code Reviewer):
"Analyze the code for bugs. Focus on:
- Logic errors
- Edge cases
- Race conditions
- Resource leaks"

Agent 2 (Log Analyzer):
"Analyze the logs for clues. Focus on:
- Error sequences
- Timing patterns
- State changes
- External dependencies"

Agent 3 (System Analyst):
"Analyze system context. Focus on:
- Resource constraints
- Configuration issues
- Dependency problems
- Infrastructure state"

Agent 4 (Historical Analyst):
"Analyze history. Focus on:
- Recent changes that could cause this
- Similar past incidents
- Regression indicators
- Pattern matching to known issues"

Council Deliberation:

Synthesize findings from all debugging agents:

Agent 1 (Code): [FINDINGS]
Agent 2 (Logs): [FINDINGS]
Agent 3 (System): [FINDINGS]
Agent 4 (History): [FINDINGS]

Tasks:
1. Find consensus root cause (where 2+ agents agree)
2. Resolve conflicting hypotheses
3. Combine evidence for strongest theory
4. Rate overall confidence (0-100)
5. Propose fix with verification steps

Operation 5: Auto-Fix Generation

Time: 10-30 minutes Automation: 60% Purpose: Generate and verify fixes

Process:

Step 1: Generate Fix (Fixer Agent)

Generate a fix for this issue:

Issue: [ROOT CAUSE]
Code: [AFFECTED CODE]

Requirements:
1. Minimal change (fix only the issue)
2. Include error handling
3. Add comments explaining the fix
4. Suggest test cases to verify

Output format:
- File: [path]
- Before: [original code]
- After: [fixed code]
- Explanation: [why this fixes it]

Step 2: Critique Fix (Critic Agent)

Critique this proposed fix:

Issue: [ORIGINAL ISSUE]
Proposed Fix: [FIX CODE]

Evaluate:
1. Does it actually fix the root cause?
2. Could it introduce new bugs?
3. Edge cases not handled?
4. Performance implications?
5. Security implications?

Verdict: APPROVE / NEEDS_REVISION / REJECT

Step 3: Generate Regression Test

Generate a regression test for this fix:

Original Bug: [DESCRIPTION]
Fix Applied: [FIX CODE]

Create test that:
1. Would have caught the original bug
2. Verifies the fix works
3. Tests edge cases
4. Can run in CI/CD

Operation 6: Self-Healing Patterns

Time: Variable Automation: 90% Purpose: Automated detection and remediation

Process:

Define Remediation Playbooks:

# Example: Auto-remediation patterns
PLAYBOOKS = {
    "disk_space_low": {
        "detection": "disk_usage > 90%",
        "actions": [
            "compress_old_logs",
            "clear_temp_files",
            "alert_if_still_high"
        ]
    },
    "memory_leak_detected": {
        "detection": "memory_growth > 10%/hour",
        "actions": [
            "capture_heap_dump",
            "graceful_restart",
            "alert_team"
        ]
    },
    "error_rate_spike": {
        "detection": "error_rate > 5%",
        "actions": [
            "check_recent_deploys",
            "consider_rollback",
            "alert_on_call"
        ]
    }
}

Configure Circuit Breakers:

# Intelligent circuit breaking
class AICircuitBreaker:
    def should_open(self, metrics):
        """AI predicts cascading failure risk."""
        prompt = f"""
        Given these metrics:
        - Error rate: {metrics['error_rate']}
        - Latency p99: {metrics['latency_p99']}
        - Dependencies health: {metrics['deps']}

        Should we open the circuit breaker?
        Risk of cascade: (Low/Medium/High)
        Recommendation: (OPEN/CLOSED/HALF_OPEN)
        """
        return analyze(prompt)

Multi-AI Coordination

Agent Assignment Strategy

Task	Primary	Verification	Strength
Log analysis	Gemini	Claude	Fast, large context
Code analysis	Claude	Codex	Deep understanding
Root cause	Claude	Gemini	Reasoning + search
Fix generation	Claude	Codex	Code + review
Research	Gemini	Claude	Web search

Coordination Commands

Gemini for Log Search:

gemini -p "Analyze these logs and identify anomalies:
[LOGS]"

Claude for Root Cause:

Given this debugging context, what's the root cause?
[CONTEXT]

Codex for Fix Validation:

codex "Review this fix for correctness and edge cases:
[FIX]"

Decision Trees

Error Type Classification

Error Type Decision Tree:

1. Is there a stack trace?
   ├── Yes → Go to Code Error Analysis
   └── No → Go to System Error Analysis

2. Code Error Analysis:
   ├── NullPointer/TypeError → Missing null check
   ├── IndexOutOfBounds → Boundary condition
   ├── Timeout → Resource/network issue
   ├── Permission denied → Auth/authz issue
   └── Unknown → Multi-agent analysis

3. System Error Analysis:
   ├── Connection refused → Service down
   ├── Disk full → Resource exhaustion
   ├── Out of memory → Memory leak/sizing
   ├── CPU spike → Performance issue
   └── Unknown → Multi-agent analysis

Severity Assessment

Severity Decision:

CRITICAL (P1):
- Data loss occurring
- Security breach active
- Service completely down
- Revenue impact immediate

HIGH (P2):
- Service degraded significantly
- Errors affecting >10% users
- Potential data integrity issues

MEDIUM (P3):
- Errors affecting <10% users
- Workaround available
- Non-critical feature broken

LOW (P4):
- Cosmetic issues
- Edge case errors
- No user impact

Integration with Observability

OpenTelemetry Pattern

from opentelemetry import trace

tracer = trace.get_tracer(__name__)

@tracer.start_as_current_span("debug_session")
def debug_with_tracing(error):
    """Debug with full trace context."""
    span = trace.get_current_span()

    # Add debugging context to trace
    span.set_attribute("error.type", type(error).__name__)
    span.set_attribute("error.message", str(error))

    # Gather trace context for AI
    context = {
        "trace_id": span.get_span_context().trace_id,
        "parent_spans": get_parent_spans(),
        "related_logs": get_correlated_logs()
    }

    return multi_agent_debug(context)

Log Correlation

def correlate_logs_for_debugging(trace_id, time_window="5m"):
    """Gather all logs related to an incident."""
    # Query logs with trace correlation
    logs = query_logs(f"""
        trace_id:{trace_id} OR
        (level:ERROR AND timestamp:[now-{time_window} TO now])
    """)

    # Feed to AI for analysis
    return analyze_logs_with_ai(logs)

Example Debugging Session

User: My API is returning 500 errors intermittently

Claude: Let me run multi-agent debugging to diagnose this.

[Launches Context, Analysis, and Verification agents]

## Debugging Report: Intermittent 500 Errors

### Agent Findings

**Log Analyzer**:
- Pattern: Errors spike every 5 minutes
- Correlation: Occurs during background job runs
- Log cluster: "Connection pool exhausted"

**Code Reviewer**:
- Found: Database connections not returned to pool in /api/users.py:145
- Missing: try/finally block for connection cleanup

**System Analyst**:
- Connection pool max: 20
- Concurrent requests during spike: 25+
- Background job opens 10 connections

**Historical**:
- Started after commit abc123 (2 days ago)
- Commit added new background job without connection limit

### Consensus Root Cause (Confidence: 92%)
Connection pool exhaustion due to:
1. New background job consuming connections
2. Missing connection cleanup in API handler

### Proposed Fix
```python
# /api/users.py:145
# Before
def get_user(id):
    conn = pool.get_connection()
    result = conn.execute("SELECT * FROM users WHERE id = ?", id)
    return result  # Connection never returned!

# After
def get_user(id):
    conn = pool.get_connection()
    try:
        result = conn.execute("SELECT * FROM users WHERE id = ?", id)
        return result
    finally:
        pool.return_connection(conn)  # Always return

Verification Steps

Add connection tracking metric
Run load test
Monitor pool usage during background job

Prevention

Add connection timeout
Implement connection leak detection
Review all pool usage


---

## Related Skills

- **multi-ai-code-review**: Review fixes before deployment
- **multi-ai-verification**: Verify fix effectiveness
- **multi-ai-testing**: Generate regression tests
- **ecs-troubleshooting**: Container-specific debugging
- **railway-troubleshooting**: Railway platform debugging

---

## References

- `references/log-analysis-patterns.md` - Log analysis techniques
- `references/self-healing-playbooks.md` - Auto-remediation patterns

multi-ai-debugging

Multi-AI Debugging

Overview

When to Use

Prerequisites

Required

Recommended

Integration

Operations

Operation 1: Quick Error Diagnosis

Operation 2: Root Cause Analysis (RCA)

Operation 3: Log Analysis & Classification

Operation 4: Multi-Agent Council Debugging

Operation 5: Auto-Fix Generation

Operation 6: Self-Healing Patterns

Multi-AI Coordination

Agent Assignment Strategy

Coordination Commands

Decision Trees

Error Type Classification

Severity Assessment

Integration with Observability

OpenTelemetry Pattern

Log Correlation

Example Debugging Session

Verification Steps

Prevention