sysmedic

Installation
SKILL.md

sysmedic — System Performance Diagnosis & Optimization

Diagnose and optimize Linux server performance using a Medical Model pipeline. Four phases: Vital Signs (quick dashboard), Examination (deep scan of flagged subsystems), Diagnosis (root cause correlation), and Treatment (safe optimization with backup-first protocol).

Why This Skill Exists

Server performance issues present as symptoms (504 errors, slow pages, high load) but the root cause is often in a different subsystem entirely. A slow website might trace back through: MySQL slow queries → PHP-FPM workers blocked waiting → web server connection queue → user-facing 504 errors. Fixing the web server config alone would miss the real problem. This skill follows the same diagnostic method a doctor uses: check vital signs first, examine only what looks abnormal, diagnose the root cause by correlating evidence across subsystems, then treat with safety-first protocol.

Core Pipeline

Phase 1: VITAL SIGNS    → 10-second CTO dashboard, traffic-light per subsystem
Phase 2: EXAMINATION     → Deep scan ONLY subsystems flagged yellow/red
Phase 3: DIAGNOSIS       → Cross-subsystem root cause correlation
Phase 4: TREATMENT       → Backup → Recommend → Manual flags → Auto-fix w/ confirmation

Phase 1: Vital Signs (CTO Dashboard)

Run a quick health check across all 8 subsystems. Each check should complete in under 2 seconds. The goal is triage — identify which subsystems need deeper investigation.

Read references/diagnostic-commands.md Section 1 (Vital Signs Commands) to get the exact commands for each subsystem. Before running commands, detect what is installed:

# Stack detection (run first)
CORES=$(nproc)
WEB=$(command -v nginx >/dev/null 2>&1 && echo nginx || (command -v httpd >/dev/null 2>&1 && echo apache || (test -f /usr/local/lsws/bin/litespeed && echo litespeed || echo none)))
DB=$(command -v mysql >/dev/null 2>&1 && echo mysql || (command -v psql >/dev/null 2>&1 && echo postgresql || echo none))
PHPFPM=$(pgrep -x "php-fpm" >/dev/null 2>&1 && echo yes || echo no)
DOCKER=$(command -v docker >/dev/null 2>&1 && docker info >/dev/null 2>&1 && echo yes || echo no)
RUNCLOUD=$(test -d /home/runcloud && echo yes || echo no)

Threshold Table

Subsystem Metric Green Yellow Red
CPU Load avg / cores < 0.7 0.7–1.0 > 1.0
CPU %steal (cloud VPS) < 3% 3–10% > 10%
Memory Used % (excl cache) < 70% 70–85% > 85%
Memory Swap used < 50MB 50–500MB > 500MB
Disk Partition usage < 75% 75–90% > 90%
Disk IO wait % < 10% 10–25% > 25%
Network Established connections < 500 500–2000 > 2000
Network TIME_WAIT count < 1000 1000–5000 > 5000
Web Server Active workers / max < 70% 70–90% > 90%
Web Server 5xx errors / min < 1 1–10 > 10
Database Connections / max < 60% 60–80% > 80%
Database Slow queries / min < 1 1–10 > 10
PHP-FPM Active / max_children < 70% 70–85% > 85%
PHP-FPM Listen queue length 0 1–10 > 10
Docker Container restarts (24h) 0 1–3 > 3
Docker Unhealthy containers none any

Dashboard Output Format

Present results in this format (skip subsystems that are not installed):

=== SYSMEDIC VITAL SIGNS === [hostname] [date/time] ===

CPU       [GREEN]  Load: 1.2/4 cores (0.30) | User: 32% | Steal: 0%
Memory    [YELLOW] Used: 78% (3.1/4.0 GB) | Swap: 120MB active
Disk      [GREEN]  / 45% | /var 62% | IO wait: 0.3%
Network   [GREEN]  Conns: 234 established | TIME_WAIT: 89
Web       [RED]    LiteSpeed workers: 48/50 (96%) | 5xx: 23/min
Database  [YELLOW] MySQL conns: 120/151 (79%) | Slow: 8/min
PHP-FPM   [RED]    Workers: 30/30 (100%) | Queue: 47 waiting
Docker    [GREEN]  Containers: 3/3 healthy | Restarts: 0

FLAGGED FOR EXAMINATION: Web Server, Database, PHP-FPM, Memory

After presenting the dashboard, ask the user which flagged subsystem to examine first, or recommend starting with the most critical (red) one.


Phase 2: Examination (Deep Scan)

Only scan subsystems that were flagged yellow or red in Phase 1. This is the key efficiency gain — do not waste time scanning green subsystems unless the user explicitly requests it.

Read references/diagnostic-commands.md Sections 2–9 — but only the sections matching the flagged subsystems:

  • CPU flagged → read Section 2 (CPU Deep)
  • Memory flagged → read Section 3 (Memory Deep)
  • Disk flagged → read Section 4 (Disk Deep)
  • Network flagged → read Section 5 (Network Deep)
  • Web Server flagged → read Section 6 (Web Server Deep)
  • Database flagged → read Section 7 (Database Deep)
  • PHP-FPM flagged → read Section 8 (PHP-FPM Deep)
  • Docker flagged → read Section 9 (Docker Deep)

If the environment is RunCloud, Docker Compose, or WordPress/WooCommerce, also read references/stack-profiles.md for stack-specific diagnostic paths and file locations.

Examination Output Format

For each examined subsystem, present findings in this structure:

=== EXAMINATION: [Subsystem Name] ===

Evidence:
  - [metric]: [value] ([what this means])
  - [metric]: [value] ([what this means])

Top Offenders:
  1. [process/query/connection]: [resource usage]
  2. [process/query/connection]: [resource usage]

Log Signals:
  - [timestamp] [source]: [relevant line]
  - [timestamp] [source]: [relevant line]

Present numbers with context — "78% memory used" is meaningless without knowing total RAM and what is consuming it. Always answer: what is happening, how severe is it, and what is causing it.


Phase 3: Diagnosis (Root Cause Analysis)

This is the most valuable phase. The goal is not to list problems — it is to find the root cause that, once fixed, resolves the cascade of symptoms.

Correlation Patterns

Cross-reference findings from Phase 2 against these common patterns:

Pattern Symptom Chain Likely Root Cause
Memory cascade High swap → high IO wait → everything slow Memory leak or undersized RAM
Query cascade Slow queries → PHP worker wait → web queue → 504s Missing index or unoptimized query
Worker exhaustion PHP-FPM queue → web 502 → user timeouts max_children too low OR slow upstream (DB)
Disk pressure Disk > 90% → DB write stall → cascade failure Unrotated logs or temp file accumulation
Noisy neighbor High %steal → inconsistent latency spikes Cloud hypervisor oversubscription
Connection storm TIME_WAIT flood → port exhaustion → refused connections Missing connection reuse settings

The pattern table is a starting point — real systems often combine multiple patterns. Trace the evidence through the chain and identify the single change that would have the most impact.

Diagnosis Output Format

=== DIAGNOSIS ===

Root Cause: [one clear sentence]
Confidence: HIGH / MEDIUM / LOW
Category: [memory | cpu | disk | network | config | query | capacity]

Causal Chain:
  [Root Cause]
    → [Intermediate effect 1]
      → [Intermediate effect 2]
        → [User-visible symptom]

Supporting Evidence:
  1. [specific metric with number from Phase 2]
  2. [specific metric with number from Phase 2]
  3. [log line or observation that confirms the chain]

If confidence is MEDIUM or LOW, explain what additional data would increase confidence and suggest specific commands to gather it.


Phase 4: Treatment (Optimize)

Step 1: Backup Checkpoint (MANDATORY)

Before making ANY changes, create a safety checkpoint. This is non-negotiable — even "safe" changes can have unexpected effects on production systems.

# Database backup (adjust for what's installed)
mysqldump --all-databases --single-transaction > /root/sysmedic-backup-$(date +%Y%m%d-%H%M%S).sql 2>/dev/null
# or: pg_dumpall > /root/sysmedic-backup-$(date +%Y%m%d-%H%M%S).sql

# Config backup (include only detected services)
tar czf /root/sysmedic-config-$(date +%Y%m%d-%H%M%S).tar.gz \
  /etc/nginx/ /etc/mysql/ /etc/php/ /etc/redis/ \
  /usr/local/lsws/conf/ /etc/nginx-rc/ \
  2>/dev/null

# Save current vital signs for before/after comparison
# (reuse Phase 1 output, save to /root/sysmedic-vitals-before.txt)

Tell the user what was backed up, where it is, and how large the backup is.

Step 2: Present Recommendations

Read references/optimization-playbooks.md to get specific fix recipes for the diagnosed root cause category.

Sort recommendations by impact and present in this format:

=== TREATMENT PLAN ===

[HIGH IMPACT] Fix: [description]
  Why: [links to root cause from Phase 3]
  Command: [exact command or config change]
  Risk: LOW / MEDIUM / HIGH
  Reversible: YES / NO (rollback: [how])
  Auto-apply: YES (safe config change) / NO (needs manual action)

[MEDIUM IMPACT] Fix: [description]
  ...

MANUAL-ONLY (cannot auto-apply):
  ⚠ [description] — Reason: [requires hardware/vendor/architecture change]

Step 3: Apply Fixes

Rules for applying changes:

  • ALWAYS ask for user confirmation before applying ANY change
  • Apply one fix at a time — never batch multiple changes
  • After each fix, verify it took effect (re-check the specific metric)
  • For service restarts: warn about brief downtime, confirm timing with user
  • For sysctl changes: apply live with sysctl -w and persist to /etc/sysctl.d/99-sysmedic.conf
  • For config file changes: always create a dated backup of the specific file first

Step 4: Verify Improvements

After all approved fixes are applied, re-run Phase 1 vital signs and present a before/after comparison:

=== BEFORE / AFTER COMPARISON ===

                   BEFORE            AFTER              CHANGE
CPU                [YELLOW] 82%     [GREEN] 45%        -37% ✓
Memory             [RED] 94%        [GREEN] 62%        -32% ✓
Database           [RED] 12 slow/m  [GREEN] 0 slow/m   Resolved ✓
PHP-FPM            [RED] Queue: 47  [GREEN] Queue: 0   Resolved ✓
Web Server         [RED] 96% busy   [YELLOW] 72% busy  Improved (monitor)

If any subsystem is still yellow or red after treatment, explain why and suggest next steps (further investigation, monitoring period, or manual action needed).


Reference Files

references/diagnostic-commands.md

Read when: Phase 1 (always — Section 1) and Phase 2 (only sections matching flagged subsystems) Contains: Complete command catalog organized by subsystem. Each command includes what it returns, how to interpret the output, and fallback commands for minimal installs. Sections: 1-Vital Signs, 2-CPU Deep, 3-Memory Deep, 4-Disk Deep, 5-Network Deep, 6-Web Server Deep, 7-Database Deep, 8-PHP-FPM Deep, 9-Docker Deep.

references/optimization-playbooks.md

Read when: Phase 4 (Treatment), after the root cause category is identified Contains: Fix recipes organized by root cause — memory pressure, CPU saturation, disk I/O, slow queries, worker exhaustion, connection management, cache optimization, kernel tuning. Each recipe includes exact commands, risk level, reversibility, and verification steps. Section 1 (Safety Protocol) should be read before applying any fix.

references/stack-profiles.md

Read when: Phase 2 or Phase 4, if the environment is RunCloud, Docker Compose, or WordPress/WooCommerce Contains: Stack-specific diagnostic paths, file locations, and optimization recipes. RunCloud per-app attribution, Docker host-vs-container diagnosis, WordPress/WooCommerce database bloat and caching. Section 4 (Detection Logic) helps identify the full stack automatically.


Important Reminders

  • NEVER make changes without a backup checkpoint. Even sysctl changes should be preceded by saving the current value.
  • NEVER restart a production database or web server without explicit user confirmation and agreement on timing.
  • Phase 2 is SELECTIVE. Only scan flagged subsystems. Scanning everything wastes time and produces noise that obscures the real issue.
  • Handle missing tools gracefully. Not every server has iostat, iotop, or nethogs. Every diagnostic has a fallback using /proc or basic coreutils (ps, free, df, ss). Check with command -v before running optional tools.
  • Work across distros. Commands must work on both Ubuntu/Debian AND CentOS/RHEL. Use existence checks before distro-specific tools (e.g., apt vs yum).
  • RunCloud servers have non-standard paths. Check for /home/runcloud/ and use /etc/nginx-rc/ instead of /etc/nginx/ when RunCloud is detected.
  • For Docker environments, diagnose both host and containers. High host CPU with low container CPU means the problem is outside containers.
  • The diagnosis is the most valuable output. A correct root cause saves hours of guessing. Invest time in correlating evidence, not just listing metrics.
  • If unsure, say so. A MEDIUM or LOW confidence diagnosis with a suggestion for further investigation is more useful than a wrong HIGH confidence diagnosis.
Related skills

More from thangphistudent/myskills

Installs
1
First Seen
Mar 26, 2026