server-management

SKILL.md

Server Management

Server management principles for production operations. Learn to THINK, not memorize commands.


1. Process Management Principles

Tool Selection

Scenario Tool
Node.js app PM2 (clustering, reload)
Any app systemd (Linux native)
Containers Docker/Podman
Orchestration Kubernetes, Docker Swarm

Process Management Goals

Goal What It Means
Restart on crash Auto-recovery
Zero-downtime reload No service interruption
Clustering Use all CPU cores
Persistence Survive server reboot

2. Monitoring Principles

What to Monitor

Category Key Metrics
Availability Uptime, health checks
Performance Response time, throughput
Errors Error rate, types
Resources CPU, memory, disk

Alert Severity Strategy

Level Response
Critical Immediate action
Warning Investigate soon
Info Review daily

Monitoring Tool Selection

Need Options
Simple/Free PM2 metrics, htop
Full observability Grafana, Datadog
Error tracking Sentry
Uptime UptimeRobot, Pingdom

3. Log Management Principles

Log Strategy

Log Type Purpose
Application logs Debug, audit
Access logs Traffic analysis
Error logs Issue detection

Log Principles

  1. Rotate logs to prevent disk fill
  2. Structured logging (JSON) for parsing
  3. Appropriate levels (error/warn/info/debug)
  4. No sensitive data in logs

4. Scaling Decisions

When to Scale

Symptom Solution
High CPU Add instances (horizontal)
High memory Increase RAM or fix leak
Slow response Profile first, then scale
Traffic spikes Auto-scaling

Scaling Strategy

Type When to Use
Vertical Quick fix, single instance
Horizontal Sustainable, distributed
Auto Variable traffic

5. Health Check Principles

What Constitutes Healthy

Check Meaning
HTTP 200 Service responding
Database connected Data accessible
Dependencies OK External services reachable
Resources OK CPU/memory not exhausted

Health Check Implementation

  • Simple: Just return 200
  • Deep: Check all dependencies
  • Choose based on load balancer needs

6. Security Principles

Area Principle
Access SSH keys only, no passwords
Firewall Only needed ports open
Updates Regular security patches
Secrets Environment vars, not files
Audit Log access and changes

7. Troubleshooting Priority

When something's wrong:

  1. Check if running (process status)
  2. Check logs (error messages)
  3. Check resources (disk, memory, CPU)
  4. Check network (ports, DNS)
  5. Check dependencies (database, APIs)

8. Anti-Patterns

āŒ Don't āœ… Do
Run as root Use non-root user
Ignore logs Set up log rotation
Skip monitoring Monitor from day one
Manual restarts Auto-restart config
No backups Regular backup schedule

Remember: A well-managed server is boring. That's the goal.

Weekly Installs
17
Installed on
claude-code13
windsurf12
opencode12
codex12
antigravity12
gemini-cli12