dd-monitors

SKILL.md

Datadog Monitors

Create, manage, and maintain monitors for alerting.

Prerequisites

This requires pup in your path. See Setup Pup.

Quick Start

pup auth login

Common Operations

List Monitors

pup monitors list
pup monitors list --tags "team:platform"
pup monitors list --status "Alert"

Get Monitor

pup monitors get <id> --json

Create Monitor

pup monitors create \
  --name "High CPU on web servers" \
  --type "metric alert" \
  --query "avg(last_5m):avg:system.cpu.user{env:prod} > 80" \
  --message "CPU above 80% @slack-ops"

Mute/Unmute

# Mute with duration
pup monitors mute --id 12345 --duration 1h

# Or mute with specific end time
pup monitors mute --id 12345 --end "2024-01-15T18:00:00Z"

# Unmute
pup monitors unmute --id 12345

⚠️ Monitor Creation Best Practices

1. Avoid Alert Fatigue

Rule Why
No flapping alerts Use last_Xm not last_1m
Meaningful thresholds Based on SLOs, not guesses
Actionable alerts If no action needed, don't alert
Include runbook @runbook-url in message
# WRONG - will flap constantly
query = "avg(last_1m):avg:system.cpu.user{*} > 50"  # ❌ Too sensitive

# CORRECT - stable alerting
query = "avg(last_5m):avg:system.cpu.user{env:prod} by {host} > 80"  # ✅ Reasonable window

2. Use Proper Scoping

# WRONG - alerts on everything
query = "avg(last_5m):avg:system.cpu.user{*} > 80"  # ❌ No scope

# CORRECT - scoped to what matters
query = "avg(last_5m):avg:system.cpu.user{env:prod,service:api} by {host} > 80"  # ✅

3. Set Recovery Thresholds

monitor = {
    "query": "avg(last_5m):avg:system.cpu.user{env:prod} > 80",
    "options": {
        "thresholds": {
            "critical": 80,
            "critical_recovery": 70,  # ✅ Prevents flapping
            "warning": 60,
            "warning_recovery": 50
        }
    }
}

4. Include Context in Messages

message = """
## High CPU Alert

Host: {{host.name}}
Current Value: {{value}}
Threshold: {{threshold}}

### Runbook
1. Check top processes: `ssh {{host.name}} 'top -bn1 | head -20'`
2. Check recent deploys
3. Scale if needed

@slack-ops @pagerduty-oncall
"""

⚠️ NEVER Delete Monitors Directly

Use safe deletion workflow (same as dashboards):

def safe_mark_monitor_for_deletion(monitor_id: str, client) -> bool:
    """Mark monitor instead of deleting."""
    monitor = client.get_monitor(monitor_id)
    name = monitor.get("name", "")
    
    if "[MARKED FOR DELETION]" in name:
        print(f"Already marked: {name}")
        return False
    
    new_name = f"[MARKED FOR DELETION] {name}"
    client.update_monitor(monitor_id, {"name": new_name})
    print(f"✓ Marked: {new_name}")
    return True

Monitor Types

Type Use Case
metric alert CPU, memory, custom metrics
query alert Complex metric queries
service check Agent check status
event alert Event stream patterns
log alert Log pattern matching
composite Combine multiple monitors
apm APM metrics

Audit Monitors

# Find monitors without owners
pup monitors list --json | jq '.[] | select(.tags | contains(["team:"]) | not) | {id, name}'

# Find noisy monitors (high alert count)
pup monitors list --json | jq 'sort_by(.overall_state_modified) | .[:10] | .[] | {id, name, status: .overall_state}'

Downtime vs Muting

Use When
Mute monitor Quick one-off, < 1 hour
Downtime Scheduled maintenance, recurring
# Downtime (preferred)
pup downtime create \
  --scope "env:prod" \
  --monitor-tags "team:platform" \
  --start "2024-01-15T02:00:00Z" \
  --end "2024-01-15T06:00:00Z"

Failure Handling

Problem Fix
Alert not firing Check query returns data, thresholds
Too many alerts Increase window, add recovery threshold
No data alerts Check agent connectivity, metric exists
Auth error pup auth refresh

References

Weekly Installs
116
GitHub Stars
69
First Seen
Feb 26, 2026
Installed on
codex114
github-copilot113
gemini-cli112
opencode112
cursor112
kimi-cli111