sysmedic — System Performance Diagnosis & Optimization

Diagnose and optimize Linux server performance using a Medical Model pipeline. Four phases: Vital Signs (quick dashboard), Examination (deep scan of flagged subsystems), Diagnosis (root cause correlation), and Treatment (safe optimization with backup-first protocol).

Why This Skill Exists

Server performance issues present as symptoms (504 errors, slow pages, high load) but the root cause is often in a different subsystem entirely. A slow website might trace back through: MySQL slow queries → PHP-FPM workers blocked waiting → web server connection queue → user-facing 504 errors. Fixing the web server config alone would miss the real problem. This skill follows the same diagnostic method a doctor uses: check vital signs first, examine only what looks abnormal, diagnose the root cause by correlating evidence across subsystems, then treat with safety-first protocol.

Core Pipeline

Phase 1: VITAL SIGNS    → 10-second CTO dashboard, traffic-light per subsystem
Phase 2: EXAMINATION     → Deep scan ONLY subsystems flagged yellow/red
Phase 3: DIAGNOSIS       → Cross-subsystem root cause correlation
Phase 4: TREATMENT       → Backup → Recommend → Manual flags → Auto-fix w/ confirmation

Phase 1: Vital Signs (CTO Dashboard)

Run a quick health check across all 8 subsystems. Each check should complete in under 2 seconds. The goal is triage — identify which subsystems need deeper investigation.

Read references/diagnostic-commands.md Section 1 (Vital Signs Commands) to get the exact commands for each subsystem. Before running commands, detect what is installed:

# Stack detection (run first)
CORES=$(nproc)
WEB=$(command -v nginx >/dev/null 2>&1 && echo nginx || (command -v httpd >/dev/null 2>&1 && echo apache || (test -f /usr/local/lsws/bin/litespeed && echo litespeed || echo none)))
DB=$(command -v mysql >/dev/null 2>&1 && echo mysql || (command -v psql >/dev/null 2>&1 && echo postgresql || echo none))
PHPFPM=$(pgrep -x "php-fpm" >/dev/null 2>&1 && echo yes || echo no)
DOCKER=$(command -v docker >/dev/null 2>&1 && docker info >/dev/null 2>&1 && echo yes || echo no)
RUNCLOUD=$(test -d /home/runcloud && echo yes || echo no)

Threshold Table

Subsystem	Metric	Green	Yellow	Red
CPU	Load avg / cores	< 0.7	0.7–1.0	> 1.0
CPU	%steal (cloud VPS)	< 3%	3–10%	> 10%
Memory	Used % (excl cache)	< 70%	70–85%	> 85%
Memory	Swap used	< 50MB	50–500MB	> 500MB
Disk	Partition usage	< 75%	75–90%	> 90%
Disk	IO wait %	< 10%	10–25%	> 25%
Network	Established connections	< 500	500–2000	> 2000
Network	TIME_WAIT count	< 1000	1000–5000	> 5000
Web Server	Active workers / max	< 70%	70–90%	> 90%
Web Server	5xx errors / min	< 1	1–10	> 10
Database	Connections / max	< 60%	60–80%	> 80%
Database	Slow queries / min	< 1	1–10	> 10
PHP-FPM	Active / max_children	< 70%	70–85%	> 85%
PHP-FPM	Listen queue length	0	1–10	> 10
Docker	Container restarts (24h)	0	1–3	> 3
Docker	Unhealthy containers	none	—	any

Dashboard Output Format

Present results in this format (skip subsystems that are not installed):

=== SYSMEDIC VITAL SIGNS === [hostname] [date/time] ===

CPU       [GREEN]  Load: 1.2/4 cores (0.30) | User: 32% | Steal: 0%
Memory    [YELLOW] Used: 78% (3.1/4.0 GB) | Swap: 120MB active
Disk      [GREEN]  / 45% | /var 62% | IO wait: 0.3%
Network   [GREEN]  Conns: 234 established | TIME_WAIT: 89
Web       [RED]    LiteSpeed workers: 48/50 (96%) | 5xx: 23/min
Database  [YELLOW] MySQL conns: 120/151 (79%) | Slow: 8/min
PHP-FPM   [RED]    Workers: 30/30 (100%) | Queue: 47 waiting
Docker    [GREEN]  Containers: 3/3 healthy | Restarts: 0

FLAGGED FOR EXAMINATION: Web Server, Database, PHP-FPM, Memory

After presenting the dashboard, ask the user which flagged subsystem to examine first, or recommend starting with the most critical (red) one.

Phase 2: Examination (Deep Scan)

Only scan subsystems that were flagged yellow or red in Phase 1. This is the key efficiency gain — do not waste time scanning green subsystems unless the user explicitly requests it.

Read references/diagnostic-commands.md Sections 2–9 — but only the sections matching the flagged subsystems:

CPU flagged → read Section 2 (CPU Deep)
Memory flagged → read Section 3 (Memory Deep)
Disk flagged → read Section 4 (Disk Deep)
Network flagged → read Section 5 (Network Deep)
Web Server flagged → read Section 6 (Web Server Deep)
Database flagged → read Section 7 (Database Deep)
PHP-FPM flagged → read Section 8 (PHP-FPM Deep)
Docker flagged → read Section 9 (Docker Deep)

If the environment is RunCloud, Docker Compose, or WordPress/WooCommerce, also read references/stack-profiles.md for stack-specific diagnostic paths and file locations.

Examination Output Format

For each examined subsystem, present findings in this structure:

=== EXAMINATION: [Subsystem Name] ===

Evidence:
  - [metric]: [value] ([what this means])
  - [metric]: [value] ([what this means])

Top Offenders:
  1. [process/query/connection]: [resource usage]
  2. [process/query/connection]: [resource usage]

Log Signals:
  - [timestamp] [source]: [relevant line]
  - [timestamp] [source]: [relevant line]

Present numbers with context — "78% memory used" is meaningless without knowing total RAM and what is consuming it. Always answer: what is happening, how severe is it, and what is causing it.

Phase 3: Diagnosis (Root Cause Analysis)

This is the most valuable phase. The goal is not to list problems — it is to find the root cause that, once fixed, resolves the cascade of symptoms.

Correlation Patterns

Cross-reference findings from Phase 2 against these common patterns:

Pattern	Symptom Chain	Likely Root Cause
Memory cascade	High swap → high IO wait → everything slow	Memory leak or undersized RAM
Query cascade	Slow queries → PHP worker wait → web queue → 504s	Missing index or unoptimized query
Worker exhaustion	PHP-FPM queue → web 502 → user timeouts	max_children too low OR slow upstream (DB)
Disk pressure	Disk > 90% → DB write stall → cascade failure	Unrotated logs or temp file accumulation
Noisy neighbor	High %steal → inconsistent latency spikes	Cloud hypervisor oversubscription
Connection storm	TIME_WAIT flood → port exhaustion → refused connections	Missing connection reuse settings

The pattern table is a starting point — real systems often combine multiple patterns. Trace the evidence through the chain and identify the single change that would have the most impact.

Diagnosis Output Format

=== DIAGNOSIS ===

Root Cause: [one clear sentence]
Confidence: HIGH / MEDIUM / LOW
Category: [memory | cpu | disk | network | config | query | capacity]

Causal Chain:
  [Root Cause]
    → [Intermediate effect 1]
      → [Intermediate effect 2]
        → [User-visible symptom]

Supporting Evidence:
  1. [specific metric with number from Phase 2]
  2. [specific metric with number from Phase 2]
  3. [log line or observation that confirms the chain]

If confidence is MEDIUM or LOW, explain what additional data would increase confidence and suggest specific commands to gather it.

Phase 4: Treatment (Optimize)

Step 1: Backup Checkpoint (MANDATORY)

Before making ANY changes, create a safety checkpoint. This is non-negotiable — even "safe" changes can have unexpected effects on production systems.

# Database backup (adjust for what's installed)
mysqldump --all-databases --single-transaction > /root/sysmedic-backup-$(date +%Y%m%d-%H%M%S).sql 2>/dev/null
# or: pg_dumpall > /root/sysmedic-backup-$(date +%Y%m%d-%H%M%S).sql

# Config backup (include only detected services)
tar czf /root/sysmedic-config-$(date +%Y%m%d-%H%M%S).tar.gz \
  /etc/nginx/ /etc/mysql/ /etc/php/ /etc/redis/ \
  /usr/local/lsws/conf/ /etc/nginx-rc/ \
  2>/dev/null

# Save current vital signs for before/after comparison
# (reuse Phase 1 output, save to /root/sysmedic-vitals-before.txt)

Tell the user what was backed up, where it is, and how large the backup is.

Step 2: Present Recommendations

Read references/optimization-playbooks.md to get specific fix recipes for the diagnosed root cause category.

Sort recommendations by impact and present in this format:

=== TREATMENT PLAN ===

[HIGH IMPACT] Fix: [description]
  Why: [links to root cause from Phase 3]
  Command: [exact command or config change]
  Risk: LOW / MEDIUM / HIGH
  Reversible: YES / NO (rollback: [how])
  Auto-apply: YES (safe config change) / NO (needs manual action)

[MEDIUM IMPACT] Fix: [description]
  ...

MANUAL-ONLY (cannot auto-apply):
  ⚠ [description] — Reason: [requires hardware/vendor/architecture change]

Step 3: Apply Fixes

Rules for applying changes:

ALWAYS ask for user confirmation before applying ANY change
Apply one fix at a time — never batch multiple changes
After each fix, verify it took effect (re-check the specific metric)
For service restarts: warn about brief downtime, confirm timing with user
For sysctl changes: apply live with sysctl -w and persist to /etc/sysctl.d/99-sysmedic.conf
For config file changes: always create a dated backup of the specific file first

Step 4: Verify Improvements

After all approved fixes are applied, re-run Phase 1 vital signs and present a before/after comparison:

=== BEFORE / AFTER COMPARISON ===

                   BEFORE            AFTER              CHANGE
CPU                [YELLOW] 82%     [GREEN] 45%        -37% ✓
Memory             [RED] 94%        [GREEN] 62%        -32% ✓
Database           [RED] 12 slow/m  [GREEN] 0 slow/m   Resolved ✓
PHP-FPM            [RED] Queue: 47  [GREEN] Queue: 0   Resolved ✓
Web Server         [RED] 96% busy   [YELLOW] 72% busy  Improved (monitor)

If any subsystem is still yellow or red after treatment, explain why and suggest next steps (further investigation, monitoring period, or manual action needed).

Reference Files

`references/diagnostic-commands.md`

Read when: Phase 1 (always — Section 1) and Phase 2 (only sections matching flagged subsystems) Contains: Complete command catalog organized by subsystem. Each command includes what it returns, how to interpret the output, and fallback commands for minimal installs. Sections: 1-Vital Signs, 2-CPU Deep, 3-Memory Deep, 4-Disk Deep, 5-Network Deep, 6-Web Server Deep, 7-Database Deep, 8-PHP-FPM Deep, 9-Docker Deep.

`references/optimization-playbooks.md`

Read when: Phase 4 (Treatment), after the root cause category is identified Contains: Fix recipes organized by root cause — memory pressure, CPU saturation, disk I/O, slow queries, worker exhaustion, connection management, cache optimization, kernel tuning. Each recipe includes exact commands, risk level, reversibility, and verification steps. Section 1 (Safety Protocol) should be read before applying any fix.

`references/stack-profiles.md`

Read when: Phase 2 or Phase 4, if the environment is RunCloud, Docker Compose, or WordPress/WooCommerce Contains: Stack-specific diagnostic paths, file locations, and optimization recipes. RunCloud per-app attribution, Docker host-vs-container diagnosis, WordPress/WooCommerce database bloat and caching. Section 4 (Detection Logic) helps identify the full stack automatically.

Important Reminders

NEVER make changes without a backup checkpoint. Even sysctl changes should be preceded by saving the current value.
NEVER restart a production database or web server without explicit user confirmation and agreement on timing.
Phase 2 is SELECTIVE. Only scan flagged subsystems. Scanning everything wastes time and produces noise that obscures the real issue.
Handle missing tools gracefully. Not every server has iostat, iotop, or nethogs. Every diagnostic has a fallback using /proc or basic coreutils (ps, free, df, ss). Check with command -v before running optional tools.
Work across distros. Commands must work on both Ubuntu/Debian AND CentOS/RHEL. Use existence checks before distro-specific tools (e.g., apt vs yum).
RunCloud servers have non-standard paths. Check for /home/runcloud/ and use /etc/nginx-rc/ instead of /etc/nginx/ when RunCloud is detected.
For Docker environments, diagnose both host and containers. High host CPU with low container CPU means the problem is outside containers.
The diagnosis is the most valuable output. A correct root cause saves hours of guessing. Invest time in correlating evidence, not just listing metrics.
If unsure, say so. A MEDIUM or LOW confidence diagnosis with a suggestion for further investigation is more useful than a wrong HIGH confidence diagnosis.

sysmedic