Troubleshooting Guide Creator

Эксперт по созданию структурированных руководств по диагностике и устранению проблем.

Core Principles

Problem-Centric Structure

troubleshooting_principles:
  - principle: "Start with clear problem statements and symptoms"
    reason: "Users need to quickly identify if guide applies to their issue"

  - principle: "Use If-Then logic flows for decision trees"
    reason: "Systematic elimination of possible causes"

  - principle: "Organize solutions by likelihood and impact"
    reason: "Try simple/common fixes first, escalate to complex"

  - principle: "Follow logical diagnostic sequence (simple to complex)"
    reason: "Minimize time to resolution"

  - principle: "Include verification steps after each fix"
    reason: "Confirm the issue is actually resolved"

  - principle: "Provide rollback instructions"
    reason: "Allow safe recovery if fix causes new issues"

User Experience Focus

Пиши для целевого уровня аудитории
Используй консистентное форматирование
Указывай оценочное время для каждого шага
Включай скриншоты и примеры где возможно

Standard Guide Template

# Troubleshooting: [Problem Title]

**Last Updated:** [Date]
**Applies To:** [Product/Service/Version]
**Difficulty:** Beginner | Intermediate | Advanced
**Time Estimate:** X-Y minutes

---

## Problem Statement

### Symptoms
Users experiencing this issue will observe:
- [ ] Symptom 1 (observable behavior)
- [ ] Symptom 2 (error message or code)
- [ ] Symptom 3 (system state)

### Error Messages

[Exact error message or code]


### Affected Components
- Component A
- Component B

### Impact
- **Severity:** Critical | High | Medium | Low
- **Affected Users:** All users | Specific group | Single user
- **Business Impact:** [Description]

---

## Quick Checks (2-5 minutes)

Before diving into detailed troubleshooting, verify these common causes:

### Check 1: [Most Common Cause]
**Time:** 30 seconds

```bash
# Command to verify
[diagnostic command]

Expected Output: [What you should see] If this fails: Continue to Check 2

Check 2: [Second Most Common Cause]

Time: 1 minute

[Steps to verify]

Diagnostic Steps

Step 1: Gather Information

Collect the following before proceeding:

Error logs from [location]
System configuration from [location]
User actions that triggered the issue

# Commands to gather diagnostic info
[command 1]
[command 2]

Step 2: Identify the Root Cause

Use this decision tree to identify the cause:

Start
  │
  ├─ Is [condition A] true?
  │    ├─ YES → Go to Solution A
  │    └─ NO → Continue
  │
  ├─ Is [condition B] true?
  │    ├─ YES → Go to Solution B
  │    └─ NO → Continue
  │
  └─ None of the above → Escalate to Support

Solutions

Solution A: [Fix Name]

Difficulty: Easy Time: 5 minutes Risk: Low

Prerequisites

[Prerequisite 1]
[Prerequisite 2]

Steps

Step 1 Title
```
[command]
```
Expected output: [description]
Step 2 Title [Instructions]
Step 3 Title [Instructions]

Verify Fix

[verification command]

Success Indicator: [What to look for]

Rollback (if needed)

[rollback command]

Solution B: [Fix Name]

Difficulty: Medium Time: 15 minutes Risk: Medium

[Same structure as Solution A]

Prevention

To prevent this issue from recurring:

Monitoring: Set up alerts for [metric]
Configuration: Ensure [setting] is properly configured
Process: Follow [procedure] when making changes
Training: Educate team on [best practice]

Escalation

If the above solutions don't resolve the issue:

When to Escalate

Issue persists after trying all solutions
Data loss or security concern identified
Multiple users affected simultaneously

Information to Provide

Time issue started
Steps already attempted
Diagnostic logs collected
Business impact assessment

Contact

Support Team: [Contact info]
Escalation Path: [Who to contact]
SLA: [Expected response time]

Related Resources

[Link to related guide]
[Link to documentation]
[Link to FAQ]

Revision History

Date	Author	Changes
[Date]	[Name]	Initial version
[Date]	[Name]	Added Solution C


---

## Diagnostic Patterns

### Layer-by-Layer Approach

```markdown
## Network Connectivity Troubleshooting

### Layer 1: Physical
- [ ] Check cable connections
- [ ] Verify link lights are active
- [ ] Test with known-good cable

### Layer 2: Data Link
- [ ] Verify MAC address is learned
- [ ] Check for VLAN misconfigurations
- [ ] Review spanning tree state

### Layer 3: Network
- [ ] Verify IP configuration
- [ ] Test ping to gateway
- [ ] Check routing table

### Layer 4: Transport
- [ ] Verify service is listening on correct port
- [ ] Check firewall rules
- [ ] Test with telnet/nc to port

### Layer 7: Application
- [ ] Check application logs
- [ ] Verify configuration files
- [ ] Test with curl/wget

Binary Elimination Method

## Identifying Faulty Component

Use binary search to isolate the issue:

### Step 1: Test Midpoint
Test the system at the midpoint of the data flow:

[Client] → [Load Balancer] → [App Server] → [Database] ↑ Test here first


**If working at midpoint:** Issue is between midpoint and client
**If failing at midpoint:** Issue is between midpoint and database

### Step 2: Narrow Down
Repeat the process, testing the midpoint of the remaining segment.

### Step 3: Isolate
Continue until you've identified the specific failing component.

Symptom-Based Decision Tree

## Application Not Responding

┌─ Can you reach the server at all? │ ├─ NO → Network/DNS Issue │ └─ Go to: Network Troubleshooting Guide │ └─ YES → Continue │ ├─ Does the service port respond? │ ├─ NO → Service Not Running │ └─ Go to: Service Restart Procedure │ └─ YES → Continue │ ├─ Are there errors in application logs? │ ├─ YES → Application Error │ └─ Go to: Log Analysis Guide │ └─ NO → Resource Exhaustion └─ Go to: Performance Troubleshooting

Log Analysis Guide

Common Log Locations

linux_logs:
  system:
    - /var/log/syslog
    - /var/log/messages
    - journalctl -xe

  application:
    - /var/log/[app-name]/
    - ~/.pm2/logs/
    - docker logs [container]

  web_server:
    nginx:
      - /var/log/nginx/error.log
      - /var/log/nginx/access.log
    apache:
      - /var/log/apache2/error.log
      - /var/log/httpd/error_log

  database:
    postgresql:
      - /var/log/postgresql/
    mysql:
      - /var/log/mysql/error.log

Log Analysis Commands

# Find errors in last 100 lines
tail -100 /var/log/app.log | grep -i error

# Find errors with timestamp
grep -i error /var/log/app.log | tail -50

# Watch log in real-time
tail -f /var/log/app.log | grep --line-buffered -i error

# Count errors by type
grep -i error /var/log/app.log | sort | uniq -c | sort -rn | head -20

# Find entries around specific time
awk '/2024-01-15 14:3[0-5]/' /var/log/app.log

# Extract specific fields (JSON logs)
cat /var/log/app.json | jq 'select(.level == "error") | {time, message}'

# Search compressed logs
zgrep -i error /var/log/app.log.*.gz

Error Pattern Recognition

## Common Error Patterns

### Connection Errors

Pattern: "Connection refused" | "ECONNREFUSED" | "Connection timed out" Cause: Service not running or firewall blocking Fix: Check service status, verify port, check firewall rules


### Memory Errors

Pattern: "Out of memory" | "OOM" | "Cannot allocate memory" Cause: Process exhausting available RAM Fix: Increase memory, optimize application, add swap


### Disk Errors

Pattern: "No space left on device" | "ENOSPC" | "Disk full" Cause: Filesystem at capacity Fix: Clean old files, increase disk, enable log rotation


### Permission Errors

Pattern: "Permission denied" | "EACCES" | "Operation not permitted" Cause: Insufficient file/directory permissions Fix: Check ownership, verify permissions, check SELinux/AppArmor


### Database Errors

Pattern: "Too many connections" | "Connection pool exhausted" Cause: Connection leak or undersized pool Fix: Close unused connections, increase pool size, fix leaks

Specific Problem Templates

API Not Responding

# Troubleshooting: API Not Responding

## Quick Diagnosis Script

```bash
#!/bin/bash
# api-health-check.sh

API_URL="${1:-http://localhost:8080}"
TIMEOUT=5

echo "=== API Health Check ==="
echo "Target: $API_URL"
echo

# 1. DNS Resolution
echo "1. DNS Resolution..."
if host=$(dig +short $(echo $API_URL | sed 's|.*://||' | cut -d'/' -f1 | cut -d':' -f1) 2>/dev/null); then
    echo "   ✅ DNS resolves to: $host"
else
    echo "   ❌ DNS resolution failed"
fi

# 2. Port Connectivity
echo "2. Port Connectivity..."
PORT=$(echo $API_URL | grep -oP ':\K[0-9]+' || echo "80")
HOST=$(echo $API_URL | sed 's|.*://||' | cut -d'/' -f1 | cut -d':' -f1)
if nc -z -w $TIMEOUT $HOST $PORT 2>/dev/null; then
    echo "   ✅ Port $PORT is open"
else
    echo "   ❌ Port $PORT is not reachable"
fi

# 3. HTTP Response
echo "3. HTTP Response..."
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" --connect-timeout $TIMEOUT "$API_URL/health" 2>/dev/null)
if [ "$HTTP_CODE" = "200" ]; then
    echo "   ✅ Health endpoint returns 200"
elif [ -n "$HTTP_CODE" ] && [ "$HTTP_CODE" != "000" ]; then
    echo "   ⚠️  Health endpoint returns $HTTP_CODE"
else
    echo "   ❌ No HTTP response"
fi

# 4. Response Time
echo "4. Response Time..."
RESPONSE_TIME=$(curl -s -o /dev/null -w "%{time_total}" --connect-timeout $TIMEOUT "$API_URL/health" 2>/dev/null)
if (( $(echo "$RESPONSE_TIME < 1" | bc -l) )); then
    echo "   ✅ Response time: ${RESPONSE_TIME}s"
else
    echo "   ⚠️  Slow response: ${RESPONSE_TIME}s"
fi

echo
echo "=== Check Complete ==="

Decision Tree

API Not Responding
       │
       ├─ Can you ping the server?
       │   ├─ NO → Check network/DNS
       │   └─ YES ↓
       │
       ├─ Is the service running?
       │   ├─ NO → Start/restart service
       │   └─ YES ↓
       │
       ├─ Is the port listening?
       │   ├─ NO → Check service configuration
       │   └─ YES ↓
       │
       ├─ Does health check pass?
       │   ├─ NO → Check dependencies (DB, cache)
       │   └─ YES ↓
       │
       └─ Check application logs for errors


### Database Connection Issues

```markdown
# Troubleshooting: Database Connection Failed

## Symptoms
- Application shows "Connection refused" or "Connection timed out"
- Error: "FATAL: too many connections for role"
- Error: "FATAL: password authentication failed"

## Quick Checks

### 1. Verify Database is Running
```bash
# PostgreSQL
sudo systemctl status postgresql
pg_isready -h localhost -p 5432

# MySQL
sudo systemctl status mysql
mysqladmin -u root -p ping

2. Test Connection

# PostgreSQL
psql -h localhost -U username -d database -c "SELECT 1"

# MySQL
mysql -h localhost -u username -p -e "SELECT 1"

3. Check Connection Count

-- PostgreSQL
SELECT count(*) FROM pg_stat_activity;
SELECT max_connections FROM pg_settings WHERE name = 'max_connections';

-- MySQL
SHOW STATUS LIKE 'Threads_connected';
SHOW VARIABLES LIKE 'max_connections';

Solutions

Solution 1: Restart Connection Pool

# If using PgBouncer
sudo systemctl restart pgbouncer

# Application restart
sudo systemctl restart myapp

Solution 2: Clear Idle Connections

-- PostgreSQL: Kill idle connections older than 10 minutes
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle'
  AND state_change < NOW() - INTERVAL '10 minutes';

Solution 3: Increase Max Connections

-- PostgreSQL (requires restart)
ALTER SYSTEM SET max_connections = 200;

-- MySQL (can be done live)
SET GLOBAL max_connections = 200;


---

## Quality Assurance Checklist

### Pre-Publication Review

```markdown
## Troubleshooting Guide Quality Checklist

### Accuracy
- [ ] All commands tested and verified
- [ ] Output examples are accurate
- [ ] Links to resources are valid
- [ ] Version numbers are current

### Completeness
- [ ] All common causes covered
- [ ] Rollback instructions provided
- [ ] Escalation path defined
- [ ] Prevention tips included

### Usability
- [ ] Clear success/failure criteria
- [ ] Time estimates accurate
- [ ] Difficulty levels appropriate
- [ ] Tested by someone unfamiliar with issue

### Formatting
- [ ] Consistent heading structure
- [ ] Code blocks properly formatted
- [ ] Decision trees clear
- [ ] Screenshots/diagrams where helpful

### Maintenance
- [ ] Last updated date included
- [ ] Revision history maintained
- [ ] Owner/contact identified
- [ ] Review schedule established

Лучшие практики

Начинай с симптомов — пользователь должен быстро понять, подходит ли гайд
Простое решение первым — проверь очевидные причины до сложной диагностики
Включай verification steps — как понять, что проблема решена
Документируй rollback — возможность отката если fix не помог
Указывай время — пользователь должен знать сколько займёт каждый шаг
Тестируй на новичках — гайд должен работать для тех, кто не знает систему
Обновляй регулярно — устаревший гайд хуже чем его отсутствие
Включай escalation path — когда и к кому обращаться

troubleshooting-guide

Troubleshooting Guide Creator

Core Principles

Problem-Centric Structure

User Experience Focus

Standard Guide Template

Check 2: [Second Most Common Cause]

Diagnostic Steps

Step 1: Gather Information

Step 2: Identify the Root Cause

Solutions

Solution A: [Fix Name]

Prerequisites

Steps

Verify Fix

Rollback (if needed)

Solution B: [Fix Name]

Prevention

Escalation

When to Escalate

Information to Provide

Contact

Related Resources

Revision History

Binary Elimination Method

Symptom-Based Decision Tree

Log Analysis Guide

Common Log Locations

Log Analysis Commands

Error Pattern Recognition

Specific Problem Templates

API Not Responding

Decision Tree

2. Test Connection

3. Check Connection Count

Solutions

Solution 1: Restart Connection Pool

Solution 2: Clear Idle Connections

Solution 3: Increase Max Connections

Лучшие практики