dd-monitors
SKILL.md
Datadog Monitors
Create, manage, and maintain monitors for alerting.
Prerequisites
This requires pup in your path. See Setup Pup.
Quick Start
pup auth login
Common Operations
List Monitors
pup monitors list
pup monitors list --tags "team:platform"
pup monitors list --status "Alert"
Get Monitor
pup monitors get <id> --json
Create Monitor
pup monitors create \
--name "High CPU on web servers" \
--type "metric alert" \
--query "avg(last_5m):avg:system.cpu.user{env:prod} > 80" \
--message "CPU above 80% @slack-ops"
Mute/Unmute
# Mute with duration
pup monitors mute --id 12345 --duration 1h
# Or mute with specific end time
pup monitors mute --id 12345 --end "2024-01-15T18:00:00Z"
# Unmute
pup monitors unmute --id 12345
⚠️ Monitor Creation Best Practices
1. Avoid Alert Fatigue
| Rule | Why |
|---|---|
| No flapping alerts | Use last_Xm not last_1m |
| Meaningful thresholds | Based on SLOs, not guesses |
| Actionable alerts | If no action needed, don't alert |
| Include runbook | @runbook-url in message |
# WRONG - will flap constantly
query = "avg(last_1m):avg:system.cpu.user{*} > 50" # ❌ Too sensitive
# CORRECT - stable alerting
query = "avg(last_5m):avg:system.cpu.user{env:prod} by {host} > 80" # ✅ Reasonable window
2. Use Proper Scoping
# WRONG - alerts on everything
query = "avg(last_5m):avg:system.cpu.user{*} > 80" # ❌ No scope
# CORRECT - scoped to what matters
query = "avg(last_5m):avg:system.cpu.user{env:prod,service:api} by {host} > 80" # ✅
3. Set Recovery Thresholds
monitor = {
"query": "avg(last_5m):avg:system.cpu.user{env:prod} > 80",
"options": {
"thresholds": {
"critical": 80,
"critical_recovery": 70, # ✅ Prevents flapping
"warning": 60,
"warning_recovery": 50
}
}
}
4. Include Context in Messages
message = """
## High CPU Alert
Host: {{host.name}}
Current Value: {{value}}
Threshold: {{threshold}}
### Runbook
1. Check top processes: `ssh {{host.name}} 'top -bn1 | head -20'`
2. Check recent deploys
3. Scale if needed
@slack-ops @pagerduty-oncall
"""
⚠️ NEVER Delete Monitors Directly
Use safe deletion workflow (same as dashboards):
def safe_mark_monitor_for_deletion(monitor_id: str, client) -> bool:
"""Mark monitor instead of deleting."""
monitor = client.get_monitor(monitor_id)
name = monitor.get("name", "")
if "[MARKED FOR DELETION]" in name:
print(f"Already marked: {name}")
return False
new_name = f"[MARKED FOR DELETION] {name}"
client.update_monitor(monitor_id, {"name": new_name})
print(f"✓ Marked: {new_name}")
return True
Monitor Types
| Type | Use Case |
|---|---|
metric alert |
CPU, memory, custom metrics |
query alert |
Complex metric queries |
service check |
Agent check status |
event alert |
Event stream patterns |
log alert |
Log pattern matching |
composite |
Combine multiple monitors |
apm |
APM metrics |
Audit Monitors
# Find monitors without owners
pup monitors list --json | jq '.[] | select(.tags | contains(["team:"]) | not) | {id, name}'
# Find noisy monitors (high alert count)
pup monitors list --json | jq 'sort_by(.overall_state_modified) | .[:10] | .[] | {id, name, status: .overall_state}'
Downtime vs Muting
| Use | When |
|---|---|
| Mute monitor | Quick one-off, < 1 hour |
| Downtime | Scheduled maintenance, recurring |
# Downtime (preferred)
pup downtime create \
--scope "env:prod" \
--monitor-tags "team:platform" \
--start "2024-01-15T02:00:00Z" \
--end "2024-01-15T06:00:00Z"
Failure Handling
| Problem | Fix |
|---|---|
| Alert not firing | Check query returns data, thresholds |
| Too many alerts | Increase window, add recovery threshold |
| No data alerts | Check agent connectivity, metric exists |
| Auth error | pup auth refresh |
References
Weekly Installs
116
Repository
datadog-labs/ag…t-skillsGitHub Stars
69
First Seen
Feb 26, 2026
Security Audits
Installed on
codex114
github-copilot113
gemini-cli112
opencode112
cursor112
kimi-cli111