service-health-check
Service Health Check Skill
Overview
This skill provides deterministic service health monitoring using the Discover-Check-Report pattern. It finds services, gathers health signals from multiple sources (process table, health files, port binding), and produces actionable reports identifying degraded or failed services.
Core principle: Health assessment is evidence-based. Never report a service healthy without verifying process status independently of health file content. Never assume a running process is functional — always cross-check against health files and port binding.
Instructions
Phase 1: DISCOVER
Goal: Identify all services to check before running any health probes.
Step 1: Locate service definitions
Search for service configuration in this order:
services.jsonin project root- Docker/docker-compose files for service definitions
- systemd unit files or process manager configs
- User-provided service specification
Step 2: Build service manifest
For each service, establish:
## Service Manifest
| Service | Process Pattern | Health File | Port | Stale Threshold |
|---------|----------------|-------------|------|-----------------|
| api-server | gunicorn.*app:app | /tmp/api_health.json | 8000 | 300s |
| worker | celery.*worker | /tmp/worker_health.json | - | 300s |
| cache | redis-server | - | 6379 | - |
Validation constraints:
- Each process pattern must be specific enough to avoid false matches (e.g., "python" matches all Python processes—use full paths or arguments instead)
- Health file paths must be absolute
- Port numbers must be valid (1-65535)
- Pattern specificity matters: narrow patterns with full command paths, distinguishing arguments, or specific binary names
Step 3: Validate manifest
Confirm each entry passes the constraints above. If a pattern is too broad, use ps aux | grep to identify distinguishing arguments, then update the pattern.
Gate: Service manifest complete with at least one service. Proceed only when gate passes.
Phase 2: CHECK
Goal: Gather health signals for every service in the manifest. Always check process status independently of health file content—a running process and a healthy health file are separate signals.
Step 1: Check process status
For each service, run process check:
pgrep -f "<process_pattern>"
Record: running (true/false), PIDs, process count.
Rationale: Process existence is the primary signal. A missing process always means the service is DOWN. A running process alone is insufficient—the service may have crashed or failed to bind to its port.
Step 2: Parse health files (if configured)
Read and parse JSON health files. Evaluate:
- Does the file exist?
- Does it parse as valid JSON?
- How old is the timestamp (staleness)? Default stale threshold is 300 seconds.
- What status does the service self-report?
- What is the connection state?
Critical constraint: Never trust health file content alone. The file could be stale from before a process crash. Always verify:
- Process is still running
- Health file timestamp is fresh (within configured threshold)
- Status field matches evidence (e.g., "error" requires restart)
Step 3: Probe ports (if configured)
Check if expected ports are listening:
ss -tlnp "sport = :<port>"
Rationale: Verify ports are actually bound. A process can start but fail to bind to its configured port—that is effectively a DOWN state, not HEALTHY.
Step 4: Evaluate health per service
Apply this decision tree (constraints embedded in logic):
- Process not running → DOWN (definitive)
- Process running + health file missing → WARNING (limited visibility, but process is alive)
- Process running + health file stale (> threshold) → WARNING (file hasn't updated in configured time, suggests no activity or crash recovery in progress)
- Process running + status=error → ERROR (restart recommended immediately)
- Process running + disconnected > 30 minutes → WARNING (long disconnect suggests stuck state, restart recommended)
- Process running + disconnected < 30 minutes → DEGRADED (allow reconnection window, monitor)
- Process running + port not listening (when port is configured) → ERROR (process running but failed to bind port)
- Process running + healthy → HEALTHY (all checks pass)
- Process running + no health file configured → RUNNING (limited visibility, process verified only)
Gate: All services evaluated with evidence-based status. No status is determined without concrete signal (process check, health file, or port probe). Proceed only when gate passes.
Phase 3: REPORT
Goal: Produce structured, actionable health report with specific remediation commands.
Step 1: Generate summary
SERVICE HEALTH REPORT
=====================
Checked: N services
Healthy: X/N
RESULTS:
service-name [OK ] HEALTHY PID 12345, uptime 2d 4h
background-worker [WARN] WARNING Health file stale (15 min)
cache-service [DOWN] DOWN Process not found
RECOMMENDATIONS:
background-worker: Restart recommended - health file not updated in 900s
cache-service: Start service - process not running
SUGGESTED ACTIONS:
systemctl restart background-worker
systemctl start cache-service
Step 2: Set exit status
- All HEALTHY/RUNNING → exit 0
- Any WARNING/DEGRADED/ERROR/DOWN → exit 1
Step 3: Present to user
- Lead with the summary line (X/N healthy)
- Highlight any services needing action
- Provide copy-pasteable commands for remediation
- Never auto-restart without explicit user flag. Always report findings first, let user decide.
Gate: Report delivered with actionable recommendations for all non-healthy services.
Examples
Example 1: Routine Health Check
User says: "Are all services up?" Actions:
- Locate services.json, build manifest (DISCOVER)
- Check each process, parse health files, probe ports (CHECK)
- Output structured report showing 3/3 healthy (REPORT) Result: Clean report, no action needed
Example 2: Stale Worker Detection
User says: "The background worker seems stuck" Actions:
- Identify worker service from config (DISCOVER)
- Find process running but health file 20 minutes stale (CHECK) — triggers WARNING decision in tree
- Report WARNING with restart recommendation (REPORT) Result: Specific diagnosis with actionable command
Error Handling
Error: "No Service Configuration Found"
Cause: No services.json, docker-compose, or systemd units discovered Solution:
- Ask user for service name and process pattern
- Build minimal manifest from user input
- Proceed with manual configuration
Error: "Process Pattern Matches Too Many PIDs"
Cause: Pattern too broad (e.g., "python" matches all Python processes) Solution:
- Narrow pattern with full command path or arguments
- Use
ps aux | grepto identify distinguishing arguments - Update manifest with more specific pattern
- Rationale: False positives hide real failures. Specificity is required to avoid misdiagnosis.
Error: "Health File Exists But Cannot Parse"
Cause: Malformed JSON, permissions issue, or file being written during read Solution:
- Check file permissions with
ls -la - Attempt raw read to inspect content
- If mid-write, retry after 2-second delay
- Report as WARNING with parse error details
References
Health File Format Reference
Services should write health files as:
{
"timestamp": "ISO8601, updated every 30-60s",
"status": "healthy|degraded|error",
"connection": "connected|disconnected|reconnecting",
"last_activity": "ISO8601 of last meaningful action",
"running": true,
"uptime_seconds": 12345,
"metrics": {}
}
Key Constraints Summary
| Constraint | Rationale | Application |
|---|---|---|
| Process status verified independently of health file | Running process ≠ functional service | Always check process before trusting health file |
| Health file staleness detected by timestamp freshness | File could be stale from before crash | Check timestamp against 300s (configurable) threshold |
| Port binding verified when configured | Process running doesn't mean port is bound | Always verify expected port listening when port specified |
| No auto-restart without explicit flag | Restart masks root cause | Report findings first; only execute restart if user flags it |
| Narrow process patterns required | "python" matches all processes, giving false matches | Use full paths or specific args; validate with ps aux | grep |
| Evidence-based status only | Status must have supporting signal | No status without concrete evidence (process, health file, or port) |
More from notque/claude-code-toolkit
generate-claudemd
Generate project-specific CLAUDE.md from repo analysis.
12fish-shell-config
Fish shell configuration and PATH management.
12pptx-generator
PPTX presentation generation with visual QA: slides, pitch decks.
12codebase-overview
Systematic codebase exploration and architecture mapping.
10image-to-video
FFmpeg-based video creation from image and audio.
9data-analysis
Decision-first data analysis with statistical rigor gates.
9