sysmedic
sysmedic — System Performance Diagnosis & Optimization
Diagnose and optimize Linux server performance using a Medical Model pipeline. Four phases: Vital Signs (quick dashboard), Examination (deep scan of flagged subsystems), Diagnosis (root cause correlation), and Treatment (safe optimization with backup-first protocol).
Why This Skill Exists
Server performance issues present as symptoms (504 errors, slow pages, high load) but the root cause is often in a different subsystem entirely. A slow website might trace back through: MySQL slow queries → PHP-FPM workers blocked waiting → web server connection queue → user-facing 504 errors. Fixing the web server config alone would miss the real problem. This skill follows the same diagnostic method a doctor uses: check vital signs first, examine only what looks abnormal, diagnose the root cause by correlating evidence across subsystems, then treat with safety-first protocol.
Core Pipeline
Phase 1: VITAL SIGNS → 10-second CTO dashboard, traffic-light per subsystem
Phase 2: EXAMINATION → Deep scan ONLY subsystems flagged yellow/red
Phase 3: DIAGNOSIS → Cross-subsystem root cause correlation
Phase 4: TREATMENT → Backup → Recommend → Manual flags → Auto-fix w/ confirmation
Phase 1: Vital Signs (CTO Dashboard)
Run a quick health check across all 8 subsystems. Each check should complete in under 2 seconds. The goal is triage — identify which subsystems need deeper investigation.
Read references/diagnostic-commands.md Section 1 (Vital Signs Commands) to get the exact commands for each subsystem. Before running commands, detect what is installed:
# Stack detection (run first)
CORES=$(nproc)
WEB=$(command -v nginx >/dev/null 2>&1 && echo nginx || (command -v httpd >/dev/null 2>&1 && echo apache || (test -f /usr/local/lsws/bin/litespeed && echo litespeed || echo none)))
DB=$(command -v mysql >/dev/null 2>&1 && echo mysql || (command -v psql >/dev/null 2>&1 && echo postgresql || echo none))
PHPFPM=$(pgrep -x "php-fpm" >/dev/null 2>&1 && echo yes || echo no)
DOCKER=$(command -v docker >/dev/null 2>&1 && docker info >/dev/null 2>&1 && echo yes || echo no)
RUNCLOUD=$(test -d /home/runcloud && echo yes || echo no)
Threshold Table
| Subsystem | Metric | Green | Yellow | Red |
|---|---|---|---|---|
| CPU | Load avg / cores | < 0.7 | 0.7–1.0 | > 1.0 |
| CPU | %steal (cloud VPS) | < 3% | 3–10% | > 10% |
| Memory | Used % (excl cache) | < 70% | 70–85% | > 85% |
| Memory | Swap used | < 50MB | 50–500MB | > 500MB |
| Disk | Partition usage | < 75% | 75–90% | > 90% |
| Disk | IO wait % | < 10% | 10–25% | > 25% |
| Network | Established connections | < 500 | 500–2000 | > 2000 |
| Network | TIME_WAIT count | < 1000 | 1000–5000 | > 5000 |
| Web Server | Active workers / max | < 70% | 70–90% | > 90% |
| Web Server | 5xx errors / min | < 1 | 1–10 | > 10 |
| Database | Connections / max | < 60% | 60–80% | > 80% |
| Database | Slow queries / min | < 1 | 1–10 | > 10 |
| PHP-FPM | Active / max_children | < 70% | 70–85% | > 85% |
| PHP-FPM | Listen queue length | 0 | 1–10 | > 10 |
| Docker | Container restarts (24h) | 0 | 1–3 | > 3 |
| Docker | Unhealthy containers | none | — | any |
Dashboard Output Format
Present results in this format (skip subsystems that are not installed):
=== SYSMEDIC VITAL SIGNS === [hostname] [date/time] ===
CPU [GREEN] Load: 1.2/4 cores (0.30) | User: 32% | Steal: 0%
Memory [YELLOW] Used: 78% (3.1/4.0 GB) | Swap: 120MB active
Disk [GREEN] / 45% | /var 62% | IO wait: 0.3%
Network [GREEN] Conns: 234 established | TIME_WAIT: 89
Web [RED] LiteSpeed workers: 48/50 (96%) | 5xx: 23/min
Database [YELLOW] MySQL conns: 120/151 (79%) | Slow: 8/min
PHP-FPM [RED] Workers: 30/30 (100%) | Queue: 47 waiting
Docker [GREEN] Containers: 3/3 healthy | Restarts: 0
FLAGGED FOR EXAMINATION: Web Server, Database, PHP-FPM, Memory
After presenting the dashboard, ask the user which flagged subsystem to examine first, or recommend starting with the most critical (red) one.
Phase 2: Examination (Deep Scan)
Only scan subsystems that were flagged yellow or red in Phase 1. This is the key efficiency gain — do not waste time scanning green subsystems unless the user explicitly requests it.
Read references/diagnostic-commands.md Sections 2–9 — but only the sections matching the flagged subsystems:
- CPU flagged → read Section 2 (CPU Deep)
- Memory flagged → read Section 3 (Memory Deep)
- Disk flagged → read Section 4 (Disk Deep)
- Network flagged → read Section 5 (Network Deep)
- Web Server flagged → read Section 6 (Web Server Deep)
- Database flagged → read Section 7 (Database Deep)
- PHP-FPM flagged → read Section 8 (PHP-FPM Deep)
- Docker flagged → read Section 9 (Docker Deep)
If the environment is RunCloud, Docker Compose, or WordPress/WooCommerce, also read references/stack-profiles.md for stack-specific diagnostic paths and file locations.
Examination Output Format
For each examined subsystem, present findings in this structure:
=== EXAMINATION: [Subsystem Name] ===
Evidence:
- [metric]: [value] ([what this means])
- [metric]: [value] ([what this means])
Top Offenders:
1. [process/query/connection]: [resource usage]
2. [process/query/connection]: [resource usage]
Log Signals:
- [timestamp] [source]: [relevant line]
- [timestamp] [source]: [relevant line]
Present numbers with context — "78% memory used" is meaningless without knowing total RAM and what is consuming it. Always answer: what is happening, how severe is it, and what is causing it.
Phase 3: Diagnosis (Root Cause Analysis)
This is the most valuable phase. The goal is not to list problems — it is to find the root cause that, once fixed, resolves the cascade of symptoms.
Correlation Patterns
Cross-reference findings from Phase 2 against these common patterns:
| Pattern | Symptom Chain | Likely Root Cause |
|---|---|---|
| Memory cascade | High swap → high IO wait → everything slow | Memory leak or undersized RAM |
| Query cascade | Slow queries → PHP worker wait → web queue → 504s | Missing index or unoptimized query |
| Worker exhaustion | PHP-FPM queue → web 502 → user timeouts | max_children too low OR slow upstream (DB) |
| Disk pressure | Disk > 90% → DB write stall → cascade failure | Unrotated logs or temp file accumulation |
| Noisy neighbor | High %steal → inconsistent latency spikes | Cloud hypervisor oversubscription |
| Connection storm | TIME_WAIT flood → port exhaustion → refused connections | Missing connection reuse settings |
The pattern table is a starting point — real systems often combine multiple patterns. Trace the evidence through the chain and identify the single change that would have the most impact.
Diagnosis Output Format
=== DIAGNOSIS ===
Root Cause: [one clear sentence]
Confidence: HIGH / MEDIUM / LOW
Category: [memory | cpu | disk | network | config | query | capacity]
Causal Chain:
[Root Cause]
→ [Intermediate effect 1]
→ [Intermediate effect 2]
→ [User-visible symptom]
Supporting Evidence:
1. [specific metric with number from Phase 2]
2. [specific metric with number from Phase 2]
3. [log line or observation that confirms the chain]
If confidence is MEDIUM or LOW, explain what additional data would increase confidence and suggest specific commands to gather it.
Phase 4: Treatment (Optimize)
Step 1: Backup Checkpoint (MANDATORY)
Before making ANY changes, create a safety checkpoint. This is non-negotiable — even "safe" changes can have unexpected effects on production systems.
# Database backup (adjust for what's installed)
mysqldump --all-databases --single-transaction > /root/sysmedic-backup-$(date +%Y%m%d-%H%M%S).sql 2>/dev/null
# or: pg_dumpall > /root/sysmedic-backup-$(date +%Y%m%d-%H%M%S).sql
# Config backup (include only detected services)
tar czf /root/sysmedic-config-$(date +%Y%m%d-%H%M%S).tar.gz \
/etc/nginx/ /etc/mysql/ /etc/php/ /etc/redis/ \
/usr/local/lsws/conf/ /etc/nginx-rc/ \
2>/dev/null
# Save current vital signs for before/after comparison
# (reuse Phase 1 output, save to /root/sysmedic-vitals-before.txt)
Tell the user what was backed up, where it is, and how large the backup is.
Step 2: Present Recommendations
Read references/optimization-playbooks.md to get specific fix recipes for the diagnosed root cause category.
Sort recommendations by impact and present in this format:
=== TREATMENT PLAN ===
[HIGH IMPACT] Fix: [description]
Why: [links to root cause from Phase 3]
Command: [exact command or config change]
Risk: LOW / MEDIUM / HIGH
Reversible: YES / NO (rollback: [how])
Auto-apply: YES (safe config change) / NO (needs manual action)
[MEDIUM IMPACT] Fix: [description]
...
MANUAL-ONLY (cannot auto-apply):
⚠ [description] — Reason: [requires hardware/vendor/architecture change]
Step 3: Apply Fixes
Rules for applying changes:
- ALWAYS ask for user confirmation before applying ANY change
- Apply one fix at a time — never batch multiple changes
- After each fix, verify it took effect (re-check the specific metric)
- For service restarts: warn about brief downtime, confirm timing with user
- For sysctl changes: apply live with
sysctl -wand persist to/etc/sysctl.d/99-sysmedic.conf - For config file changes: always create a dated backup of the specific file first
Step 4: Verify Improvements
After all approved fixes are applied, re-run Phase 1 vital signs and present a before/after comparison:
=== BEFORE / AFTER COMPARISON ===
BEFORE AFTER CHANGE
CPU [YELLOW] 82% [GREEN] 45% -37% ✓
Memory [RED] 94% [GREEN] 62% -32% ✓
Database [RED] 12 slow/m [GREEN] 0 slow/m Resolved ✓
PHP-FPM [RED] Queue: 47 [GREEN] Queue: 0 Resolved ✓
Web Server [RED] 96% busy [YELLOW] 72% busy Improved (monitor)
If any subsystem is still yellow or red after treatment, explain why and suggest next steps (further investigation, monitoring period, or manual action needed).
Reference Files
references/diagnostic-commands.md
Read when: Phase 1 (always — Section 1) and Phase 2 (only sections matching flagged subsystems) Contains: Complete command catalog organized by subsystem. Each command includes what it returns, how to interpret the output, and fallback commands for minimal installs. Sections: 1-Vital Signs, 2-CPU Deep, 3-Memory Deep, 4-Disk Deep, 5-Network Deep, 6-Web Server Deep, 7-Database Deep, 8-PHP-FPM Deep, 9-Docker Deep.
references/optimization-playbooks.md
Read when: Phase 4 (Treatment), after the root cause category is identified Contains: Fix recipes organized by root cause — memory pressure, CPU saturation, disk I/O, slow queries, worker exhaustion, connection management, cache optimization, kernel tuning. Each recipe includes exact commands, risk level, reversibility, and verification steps. Section 1 (Safety Protocol) should be read before applying any fix.
references/stack-profiles.md
Read when: Phase 2 or Phase 4, if the environment is RunCloud, Docker Compose, or WordPress/WooCommerce Contains: Stack-specific diagnostic paths, file locations, and optimization recipes. RunCloud per-app attribution, Docker host-vs-container diagnosis, WordPress/WooCommerce database bloat and caching. Section 4 (Detection Logic) helps identify the full stack automatically.
Important Reminders
- NEVER make changes without a backup checkpoint. Even sysctl changes should be preceded by saving the current value.
- NEVER restart a production database or web server without explicit user confirmation and agreement on timing.
- Phase 2 is SELECTIVE. Only scan flagged subsystems. Scanning everything wastes time and produces noise that obscures the real issue.
- Handle missing tools gracefully. Not every server has
iostat,iotop, ornethogs. Every diagnostic has a fallback using/procor basic coreutils (ps,free,df,ss). Check withcommand -vbefore running optional tools. - Work across distros. Commands must work on both Ubuntu/Debian AND CentOS/RHEL. Use existence checks before distro-specific tools (e.g.,
aptvsyum). - RunCloud servers have non-standard paths. Check for
/home/runcloud/and use/etc/nginx-rc/instead of/etc/nginx/when RunCloud is detected. - For Docker environments, diagnose both host and containers. High host CPU with low container CPU means the problem is outside containers.
- The diagnosis is the most valuable output. A correct root cause saves hours of guessing. Invest time in correlating evidence, not just listing metrics.
- If unsure, say so. A MEDIUM or LOW confidence diagnosis with a suggestion for further investigation is more useful than a wrong HIGH confidence diagnosis.