multi-ai-debugging
Multi-AI Debugging
Overview
multi-ai-debugging provides systematic debugging workflows using multiple AI models as specialized agents. Based on 2024-2025 best practices for AI-assisted debugging with multi-agent architectures.
Purpose: Systematic root cause analysis and fix generation using AI ensemble
Pattern: Task-based (6 independent debugging operations)
Key Principles (validated by tri-AI research):
- Multi-Agent Council - Specialized agents debate root causes before consensus
- Evaluator/Critic Loops - Fix agent + critic agent verify solutions
- Trace-Aware Analysis - Full execution context, not just error messages
- Semantic Log Analysis - LLM understanding beyond regex matching
- Cross-Stack Correlation - Connect frontend, backend, infra issues
- Auto-Remediation - Self-healing patterns where safe
Quality Targets:
- Root Cause Identification: >80% accuracy
- Time to Diagnosis: <30 minutes for common issues
- Fix Generation Success: >60% for known patterns
- False Positive Rate: <20% on error classification
When to Use
Use multi-ai-debugging when:
- Debugging production incidents
- Analyzing error logs and stack traces
- Performing root cause analysis (RCA)
- Troubleshooting complex multi-service systems
- Investigating performance degradation
- Understanding cascading failures
- Writing post-mortem reports
When NOT to Use:
- Simple syntax errors (IDE handles these)
- Clear compile-time errors
- Well-documented known issues
Prerequisites
Required
- Error information (logs, stack trace, error message)
- Access to relevant code
Recommended
- Execution traces (if available)
- System metrics/observability data
- Recent change history (git log)
- Gemini CLI for web research
- Codex CLI for deep code analysis
Integration
- OpenTelemetry traces (ideal)
- Log aggregation (CloudWatch, Datadog, etc.)
- APM tools (optional)
Operations
Operation 1: Quick Error Diagnosis
Time: 2-5 minutes Automation: 70% Purpose: Fast initial diagnosis for common errors
Process:
- Analyze Error:
Diagnose this error:
Error: [PASTE ERROR MESSAGE]
Stack trace: [PASTE STACK TRACE]
Provide:
1. What type of error is this?
2. Most likely root cause (1-2 sentences)
3. Immediate fix suggestion
4. Prevention recommendation
- Verify with Gemini (for web research):
gemini -p "Search for solutions to this error:
[ERROR MESSAGE]
Find:
- Common causes
- Stack Overflow solutions
- GitHub issues with fixes"
- Output: Quick diagnosis with fix suggestion
Operation 2: Root Cause Analysis (RCA)
Time: 15-45 minutes Automation: 50% Purpose: Deep root cause analysis for complex issues
Process:
Step 1: Gather Context (Context Agent)
# Recent changes
git log --oneline -20
git diff HEAD~5..HEAD --stat
# Related logs
grep -r "ERROR\|Exception\|WARN" logs/ | tail -100
# System state
# Check for relevant metrics, traces, etc.
Step 2: Hypothesis Generation (Analysis Agent)
Perform root cause analysis:
**Error/Symptom**: [DESCRIPTION]
**Context**:
- Recent changes: [GIT LOG]
- Logs: [RELEVANT LOG ENTRIES]
- System state: [METRICS/OBSERVATIONS]
- When started: [TIMESTAMP]
- Affected scope: [USERS/SERVICES]
**Tasks**:
1. List 3-5 probable root causes ranked by likelihood
2. For each hypothesis:
- Evidence supporting it
- Evidence against it
- Confidence level (High/Medium/Low)
3. Recommend investigation steps for top hypothesis
Step 3: Cross-Validate (Verification Agent)
Verify this root cause hypothesis:
Hypothesis: [TOP HYPOTHESIS]
Evidence: [SUPPORTING EVIDENCE]
Tasks:
1. What would we expect to see if this is correct?
2. What would disprove this hypothesis?
3. Design a reproduction test
4. Confidence assessment (0-100)
Step 4: Generate RCA Report
## Root Cause Analysis Report
**Incident**: [DESCRIPTION]
**Date**: [DATE]
**Duration**: [DURATION]
**Impact**: [USERS/SERVICES AFFECTED]
### Timeline
- HH:MM - First error observed
- HH:MM - Investigation began
- HH:MM - Root cause identified
- HH:MM - Fix deployed
### Root Cause
[DETAILED EXPLANATION]
### Contributing Factors
1. [FACTOR 1]
2. [FACTOR 2]
### Resolution
[FIX APPLIED]
### Prevention
1. [ACTION ITEM 1]
2. [ACTION ITEM 2]
Operation 3: Log Analysis & Classification
Time: 5-15 minutes Automation: 80% Purpose: Analyze and classify error logs
Process:
- Cluster Log Patterns:
Analyze these log entries:
[PASTE 50-100 LOG LINES]
Tasks:
1. Identify unique error patterns (cluster similar errors)
2. Classify each pattern:
- Type: (Bug/Config/Network/Resource/Security/User Error)
- Severity: (Critical/High/Medium/Low)
- Impact: (Data Loss/Service Down/Degraded/Minor)
3. Count occurrences per pattern
4. Identify the root pattern (original error vs cascading)
5. Recommend priority order for investigation
- Semantic Analysis:
Perform semantic analysis on these logs:
[LOG ENTRIES]
Looking for:
- Anomalies in timing/sequence
- Correlation between events
- Hidden dependencies
- Patterns human might miss
- Output: Classified and prioritized error report
Operation 4: Multi-Agent Council Debugging
Time: 20-60 minutes Automation: 40% Purpose: Complex issues requiring multiple perspectives
Process:
Launch Parallel Agents:
Launch 4 debugging agents for this issue:
Issue: [DESCRIPTION]
Code: [RELEVANT CODE]
Logs: [RELEVANT LOGS]
Agent 1 (Code Reviewer):
"Analyze the code for bugs. Focus on:
- Logic errors
- Edge cases
- Race conditions
- Resource leaks"
Agent 2 (Log Analyzer):
"Analyze the logs for clues. Focus on:
- Error sequences
- Timing patterns
- State changes
- External dependencies"
Agent 3 (System Analyst):
"Analyze system context. Focus on:
- Resource constraints
- Configuration issues
- Dependency problems
- Infrastructure state"
Agent 4 (Historical Analyst):
"Analyze history. Focus on:
- Recent changes that could cause this
- Similar past incidents
- Regression indicators
- Pattern matching to known issues"
Council Deliberation:
Synthesize findings from all debugging agents:
Agent 1 (Code): [FINDINGS]
Agent 2 (Logs): [FINDINGS]
Agent 3 (System): [FINDINGS]
Agent 4 (History): [FINDINGS]
Tasks:
1. Find consensus root cause (where 2+ agents agree)
2. Resolve conflicting hypotheses
3. Combine evidence for strongest theory
4. Rate overall confidence (0-100)
5. Propose fix with verification steps
Operation 5: Auto-Fix Generation
Time: 10-30 minutes Automation: 60% Purpose: Generate and verify fixes
Process:
Step 1: Generate Fix (Fixer Agent)
Generate a fix for this issue:
Issue: [ROOT CAUSE]
Code: [AFFECTED CODE]
Requirements:
1. Minimal change (fix only the issue)
2. Include error handling
3. Add comments explaining the fix
4. Suggest test cases to verify
Output format:
- File: [path]
- Before: [original code]
- After: [fixed code]
- Explanation: [why this fixes it]
Step 2: Critique Fix (Critic Agent)
Critique this proposed fix:
Issue: [ORIGINAL ISSUE]
Proposed Fix: [FIX CODE]
Evaluate:
1. Does it actually fix the root cause?
2. Could it introduce new bugs?
3. Edge cases not handled?
4. Performance implications?
5. Security implications?
Verdict: APPROVE / NEEDS_REVISION / REJECT
Step 3: Generate Regression Test
Generate a regression test for this fix:
Original Bug: [DESCRIPTION]
Fix Applied: [FIX CODE]
Create test that:
1. Would have caught the original bug
2. Verifies the fix works
3. Tests edge cases
4. Can run in CI/CD
Operation 6: Self-Healing Patterns
Time: Variable Automation: 90% Purpose: Automated detection and remediation
Process:
Define Remediation Playbooks:
# Example: Auto-remediation patterns
PLAYBOOKS = {
"disk_space_low": {
"detection": "disk_usage > 90%",
"actions": [
"compress_old_logs",
"clear_temp_files",
"alert_if_still_high"
]
},
"memory_leak_detected": {
"detection": "memory_growth > 10%/hour",
"actions": [
"capture_heap_dump",
"graceful_restart",
"alert_team"
]
},
"error_rate_spike": {
"detection": "error_rate > 5%",
"actions": [
"check_recent_deploys",
"consider_rollback",
"alert_on_call"
]
}
}
Configure Circuit Breakers:
# Intelligent circuit breaking
class AICircuitBreaker:
def should_open(self, metrics):
"""AI predicts cascading failure risk."""
prompt = f"""
Given these metrics:
- Error rate: {metrics['error_rate']}
- Latency p99: {metrics['latency_p99']}
- Dependencies health: {metrics['deps']}
Should we open the circuit breaker?
Risk of cascade: (Low/Medium/High)
Recommendation: (OPEN/CLOSED/HALF_OPEN)
"""
return analyze(prompt)
Multi-AI Coordination
Agent Assignment Strategy
| Task | Primary | Verification | Strength |
|---|---|---|---|
| Log analysis | Gemini | Claude | Fast, large context |
| Code analysis | Claude | Codex | Deep understanding |
| Root cause | Claude | Gemini | Reasoning + search |
| Fix generation | Claude | Codex | Code + review |
| Research | Gemini | Claude | Web search |
Coordination Commands
Gemini for Log Search:
gemini -p "Analyze these logs and identify anomalies:
[LOGS]"
Claude for Root Cause:
Given this debugging context, what's the root cause?
[CONTEXT]
Codex for Fix Validation:
codex "Review this fix for correctness and edge cases:
[FIX]"
Decision Trees
Error Type Classification
Error Type Decision Tree:
1. Is there a stack trace?
├── Yes → Go to Code Error Analysis
└── No → Go to System Error Analysis
2. Code Error Analysis:
├── NullPointer/TypeError → Missing null check
├── IndexOutOfBounds → Boundary condition
├── Timeout → Resource/network issue
├── Permission denied → Auth/authz issue
└── Unknown → Multi-agent analysis
3. System Error Analysis:
├── Connection refused → Service down
├── Disk full → Resource exhaustion
├── Out of memory → Memory leak/sizing
├── CPU spike → Performance issue
└── Unknown → Multi-agent analysis
Severity Assessment
Severity Decision:
CRITICAL (P1):
- Data loss occurring
- Security breach active
- Service completely down
- Revenue impact immediate
HIGH (P2):
- Service degraded significantly
- Errors affecting >10% users
- Potential data integrity issues
MEDIUM (P3):
- Errors affecting <10% users
- Workaround available
- Non-critical feature broken
LOW (P4):
- Cosmetic issues
- Edge case errors
- No user impact
Integration with Observability
OpenTelemetry Pattern
from opentelemetry import trace
tracer = trace.get_tracer(__name__)
@tracer.start_as_current_span("debug_session")
def debug_with_tracing(error):
"""Debug with full trace context."""
span = trace.get_current_span()
# Add debugging context to trace
span.set_attribute("error.type", type(error).__name__)
span.set_attribute("error.message", str(error))
# Gather trace context for AI
context = {
"trace_id": span.get_span_context().trace_id,
"parent_spans": get_parent_spans(),
"related_logs": get_correlated_logs()
}
return multi_agent_debug(context)
Log Correlation
def correlate_logs_for_debugging(trace_id, time_window="5m"):
"""Gather all logs related to an incident."""
# Query logs with trace correlation
logs = query_logs(f"""
trace_id:{trace_id} OR
(level:ERROR AND timestamp:[now-{time_window} TO now])
""")
# Feed to AI for analysis
return analyze_logs_with_ai(logs)
Example Debugging Session
User: My API is returning 500 errors intermittently
Claude: Let me run multi-agent debugging to diagnose this.
[Launches Context, Analysis, and Verification agents]
## Debugging Report: Intermittent 500 Errors
### Agent Findings
**Log Analyzer**:
- Pattern: Errors spike every 5 minutes
- Correlation: Occurs during background job runs
- Log cluster: "Connection pool exhausted"
**Code Reviewer**:
- Found: Database connections not returned to pool in /api/users.py:145
- Missing: try/finally block for connection cleanup
**System Analyst**:
- Connection pool max: 20
- Concurrent requests during spike: 25+
- Background job opens 10 connections
**Historical**:
- Started after commit abc123 (2 days ago)
- Commit added new background job without connection limit
### Consensus Root Cause (Confidence: 92%)
Connection pool exhaustion due to:
1. New background job consuming connections
2. Missing connection cleanup in API handler
### Proposed Fix
```python
# /api/users.py:145
# Before
def get_user(id):
conn = pool.get_connection()
result = conn.execute("SELECT * FROM users WHERE id = ?", id)
return result # Connection never returned!
# After
def get_user(id):
conn = pool.get_connection()
try:
result = conn.execute("SELECT * FROM users WHERE id = ?", id)
return result
finally:
pool.return_connection(conn) # Always return
Verification Steps
- Add connection tracking metric
- Run load test
- Monitor pool usage during background job
Prevention
- Add connection timeout
- Implement connection leak detection
- Review all pool usage
---
## Related Skills
- **multi-ai-code-review**: Review fixes before deployment
- **multi-ai-verification**: Verify fix effectiveness
- **multi-ai-testing**: Generate regression tests
- **ecs-troubleshooting**: Container-specific debugging
- **railway-troubleshooting**: Railway platform debugging
---
## References
- `references/log-analysis-patterns.md` - Log analysis techniques
- `references/self-healing-playbooks.md` - Auto-remediation patterns