alerting-oncall
SKILL.md
Alerting & On-Call
Configure effective alerting and on-call management for production systems.
When to Use This Skill
Use this skill when:
- Setting up alerting rules and thresholds
- Configuring on-call rotations and schedules
- Implementing alert routing and escalation
- Reducing alert fatigue
- Managing incident response workflows
Prerequisites
- Monitoring system (Prometheus, Datadog, etc.)
- On-call platform (PagerDuty, Opsgenie, Grafana OnCall)
- Communication channels (Slack, email)
Alerting Best Practices
Alert Categories
# Severity levels
critical:
- Service completely down
- Data loss imminent
- Security breach
response: Immediate page, wake people up
high:
- Service degraded significantly
- Error rate above SLO
- Capacity near limit
response: Page during business hours, notify after hours
medium:
- Performance degradation
- Non-critical component failure
- Warning thresholds exceeded
response: Notify via Slack, review next business day
low:
- Informational alerts
- Capacity planning triggers
- Routine maintenance needed
response: Email notification, weekly review
Alert Design Principles
# Good alert characteristics
alerts:
actionable:
- Every alert should require human action
- Include runbook links
- Clear remediation steps
relevant:
- Alert on symptoms, not causes
- Focus on user impact
- Avoid alerting on expected behavior
timely:
- Appropriate thresholds
- Suitable evaluation windows
- Account for normal variance
unique:
- No duplicate alerts
- Proper alert grouping
- Clear ownership
Prometheus Alerting
Alert Rules
# prometheus/rules/alerts.yml
groups:
- name: service_alerts
rules:
# High-level service health
- alert: ServiceDown
expr: up{job="myapp"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Service {{ $labels.instance }} is down"
description: "{{ $labels.job }} on {{ $labels.instance }} has been down for more than 1 minute."
runbook_url: "https://wiki.example.com/runbooks/service-down"
# Error rate alert
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/ sum(rate(http_requests_total[5m])) by (service) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate for {{ $labels.service }}"
description: "Error rate is {{ $value | humanizePercentage }} for the last 5 minutes"
# Latency alert (SLO-based)
- alert: HighLatency
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
) > 0.5
for: 5m
labels:
severity: high
annotations:
summary: "P95 latency above 500ms for {{ $labels.service }}"
Alertmanager Configuration
# alertmanager.yml
global:
resolve_timeout: 5m
slack_api_url: 'https://hooks.slack.com/services/xxx'
pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'
templates:
- '/etc/alertmanager/templates/*.tmpl'
route:
receiver: 'default-receiver'
group_by: ['alertname', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
# Critical alerts go to PagerDuty
- match:
severity: critical
receiver: 'pagerduty-critical'
group_wait: 0s
repeat_interval: 1h
# High severity during business hours
- match:
severity: high
receiver: 'slack-high'
active_time_intervals:
- business-hours
# Route by team
- match_re:
team: platform.*
receiver: 'platform-team'
receivers:
- name: 'default-receiver'
slack_configs:
- channel: '#alerts'
send_resolved: true
- name: 'pagerduty-critical'
pagerduty_configs:
- service_key: 'xxx'
severity: critical
description: '{{ .CommonAnnotations.summary }}'
details:
firing: '{{ template "pagerduty.firing" . }}'
- name: 'slack-high'
slack_configs:
- channel: '#alerts-high'
title: '{{ .CommonAnnotations.summary }}'
text: '{{ .CommonAnnotations.description }}'
actions:
- type: button
text: 'Runbook'
url: '{{ .CommonAnnotations.runbook_url }}'
- type: button
text: 'Dashboard'
url: '{{ .CommonAnnotations.dashboard_url }}'
- name: 'platform-team'
slack_configs:
- channel: '#platform-alerts'
time_intervals:
- name: business-hours
time_intervals:
- weekdays: ['monday:friday']
times:
- start_time: '09:00'
end_time: '17:00'
inhibit_rules:
- source_match:
severity: critical
target_match:
severity: high
equal: ['service']
PagerDuty Integration
Service Configuration
# Terraform example
resource "pagerduty_service" "myapp" {
name = "MyApp Production"
description = "Production application service"
escalation_policy = pagerduty_escalation_policy.default.id
alert_creation = "create_alerts_and_incidents"
auto_resolve_timeout = 14400 # 4 hours
acknowledgement_timeout = 600 # 10 minutes
incident_urgency_rule {
type = "use_support_hours"
during_support_hours {
type = "constant"
urgency = "high"
}
outside_support_hours {
type = "constant"
urgency = "low"
}
}
}
resource "pagerduty_escalation_policy" "default" {
name = "Default Escalation"
num_loops = 2
rule {
escalation_delay_in_minutes = 10
target {
type = "schedule_reference"
id = pagerduty_schedule.primary.id
}
}
rule {
escalation_delay_in_minutes = 15
target {
type = "user_reference"
id = pagerduty_user.manager.id
}
}
}
Schedule Configuration
resource "pagerduty_schedule" "primary" {
name = "Primary On-Call"
time_zone = "America/New_York"
layer {
name = "Weekly Rotation"
start = "2024-01-01T00:00:00-05:00"
rotation_virtual_start = "2024-01-01T00:00:00-05:00"
rotation_turn_length_seconds = 604800 # 1 week
users = [for user in pagerduty_user.oncall : user.id]
}
# Override layer for holidays
layer {
name = "Holiday Coverage"
start = "2024-01-01T00:00:00-05:00"
rotation_virtual_start = "2024-01-01T00:00:00-05:00"
rotation_turn_length_seconds = 86400
users = [pagerduty_user.holiday_coverage.id]
restriction {
type = "daily_restriction"
start_time_of_day = "00:00:00"
duration_seconds = 86400
start_day_of_week = 0 # Sunday
}
}
}
Grafana OnCall
Integration Setup
# docker-compose.yml addition
services:
oncall:
image: grafana/oncall
environment:
- SECRET_KEY=your-secret-key
- BASE_URL=http://oncall:8080
- GRAFANA_API_URL=http://grafana:3000
ports:
- "8080:8080"
Escalation Chain
# Example escalation chain structure
escalation_chains:
- name: "Production Critical"
steps:
- step: 1
type: notify
persons:
- "@oncall-primary"
wait_delay: 0
- step: 2
type: notify
persons:
- "@oncall-secondary"
wait_delay: 5m
- step: 3
type: notify
persons:
- "@engineering-manager"
wait_delay: 10m
- step: 4
type: trigger_action
action: "escalate_to_incident_commander"
wait_delay: 15m
Alert Templates
Slack Alert Template
{{ define "slack.title" }}
[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .CommonLabels.alertname }}
{{ end }}
{{ define "slack.text" }}
{{ range .Alerts }}
*Alert:* {{ .Annotations.summary }}
*Severity:* {{ .Labels.severity }}
*Description:* {{ .Annotations.description }}
*Runbook:* {{ .Annotations.runbook_url }}
{{ end }}
{{ end }}
PagerDuty Details Template
{{ define "pagerduty.firing" }}
{{ range .Alerts.Firing }}
Alert: {{ .Labels.alertname }}
Service: {{ .Labels.service }}
Instance: {{ .Labels.instance }}
Value: {{ .Annotations.value }}
Started: {{ .StartsAt.Format "2006-01-02 15:04:05" }}
{{ end }}
{{ end }}
On-Call Best Practices
Rotation Guidelines
on_call_guidelines:
rotation_length: 1 week
handoff_time: "10:00 AM Monday"
responsibilities:
- Monitor alerts during shift
- Respond within SLA (critical: 5min, high: 15min)
- Document incidents
- Handoff unresolved issues
support:
- Secondary on-call for backup
- Clear escalation path
- Manager availability for major incidents
wellness:
- Maximum 1 week on-call per month
- Comp time after high-alert periods
- No-interrupt recovery day after shift
Runbook Template
# Alert: High Error Rate
## Summary
Error rate has exceeded the threshold of 5% for the service.
## Impact
Users may experience errors when accessing the application.
## Investigation Steps
1. Check service logs: `kubectl logs -l app=myapp -n production`
2. Review recent deployments: `kubectl rollout history deployment/myapp`
3. Check database connectivity: `kubectl exec -it myapp -- nc -zv postgres 5432`
4. Review error traces in APM dashboard
## Remediation
### If caused by recent deployment:
```bash
kubectl rollout undo deployment/myapp -n production
If database related:
kubectl delete pod -l app=postgres -n production
Escalation
If not resolved within 15 minutes, escalate to:
- Database team: @db-oncall
- Platform team: @platform-oncall
## Alert Fatigue Reduction
### Strategies
```yaml
fatigue_reduction:
aggregate_alerts:
- Group related alerts
- Use inhibit rules
- Implement alert correlation
tune_thresholds:
- Base on SLOs, not arbitrary values
- Account for normal variance
- Use appropriate evaluation windows
automate_responses:
- Auto-remediation for known issues
- Self-healing infrastructure
- Automated scaling
regular_review:
- Weekly alert review
- Remove unused alerts
- Update thresholds based on data
Common Issues
Issue: Alert Storm
Problem: Too many alerts firing simultaneously Solution: Implement proper grouping and inhibition rules
Issue: Missed Alerts
Problem: Critical alerts not reaching on-call Solution: Test escalation policies, verify contact methods
Issue: False Positives
Problem: Alerts firing without actual issues Solution: Tune thresholds, increase evaluation windows
Best Practices
- Define clear severity levels
- Every alert needs a runbook
- Test on-call notifications regularly
- Review and tune alerts weekly
- Implement proper escalation paths
- Use alert grouping and inhibition
- Track alert metrics (MTTR, frequency)
- Practice incident response regularly
Related Skills
- prometheus-grafana - Monitoring setup
- incident-response - Incident handling
- runbook-creation - Runbook creation
Weekly Installs
14
Repository
bagelhole/devop…t-skillsGitHub Stars
13
First Seen
Feb 4, 2026
Security Audits
Installed on
opencode14
codex14
gemini-cli13
github-copilot12
kimi-cli12
amp12