monitoring-skill
SKILL.md
Monitoring & Observability Skill
Overview
Master the three pillars of observability: metrics, logs, and traces.
Parameters
| Name | Type | Required | Default | Description |
|---|---|---|---|---|
| pillar | string | No | all | Observability pillar |
| tool | string | No | prometheus | Tool focus |
Core Topics
MANDATORY
- Prometheus metrics and PromQL
- Grafana dashboards
- ELK Stack basics
- SLIs, SLOs, error budgets
- Alerting rules
OPTIONAL
- Distributed tracing
- OpenTelemetry
- Custom exporters
- Log correlation
ADVANCED
- High cardinality handling
- Recording rules
- Federation
- Continuous profiling
Quick Reference
# PromQL
sum(rate(http_requests_total[5m])) by (service)
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
100 * sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
# Prometheus API
curl http://localhost:9090/api/v1/targets
curl 'http://localhost:9090/api/v1/query?query=up'
curl -X POST http://localhost:9090/-/reload
# Alertmanager
amtool silence add alertname="HighLatency" --duration=2h
amtool alert
SRE Golden Signals
| Signal | Metric |
|---|---|
| Latency | histogram_quantile(0.99, ...) |
| Traffic | sum(rate(requests_total[5m])) |
| Errors | rate(errors_total[5m]) |
| Saturation | node_memory_MemAvailable_bytes |
Troubleshooting
Common Failures
| Symptom | Root Cause | Solution |
|---|---|---|
| No data | Scrape failing | Check targets page |
| Alert not firing | PromQL error | Test in UI |
| High cardinality | Too many labels | Reduce labels |
| Slow queries | Too much data | Add aggregation |
Debug Checklist
- Check targets:
/targets - Test query in UI
- Check logs:
journalctl -u prometheus - Verify time sync (NTP)
Recovery Procedures
Prometheus OOM
- Check cardinality
- Reduce retention
- Add federation
Resources
Weekly Installs
2
Installed on
opencode2
claude-code2
antigravity2
gemini-cli2
windsurf1
cursor1