skills/ryanmaclean/dd-skill-test/datadog-operations

datadog-operations

SKILL.md

Datadog Operations

Complete Datadog automation: query APIs, create infrastructure, manage incidents, and automate responses. 73% platform coverage with 17 working scripts.

What This Skill Does

Investigation & Analysis:

  • Query APM traces to identify performance bottlenecks
  • Search logs for error patterns and anomalies
  • Detect security threats and attack attempts
  • Analyze Watchdog anomaly detection alerts
  • Query metrics with statistical analysis
  • Analyze Datadog usage and costs (FinOps)
  • Monitor LLM observability for GenAI applications
  • Query SLO status and error budgets
  • List services from service catalog
  • Analyze database query performance
  • Track frontend performance with RUM

Automation & Creation:

  • Create and manage monitors with alert thresholds
  • Generate dashboards for APM, security, costs, and LLM observability
  • Trigger Datadog workflows for incident response
  • Create and update incidents
  • Mute/unmute monitors during maintenance
  • Create synthetic uptime checks and browser tests

Prerequisites

Set environment variables:

export DD_API_KEY=your_api_key
export DD_APP_KEY=your_application_key
export DD_SITE=datadoghq.com  # or datadoghq.eu, us3.datadoghq.com, etc.

Get keys from Datadog: Organization Settings > API Keys and Application Keys

Working Scripts

1. Query APM Performance

Find slow endpoints and performance issues:

bash scripts/query-apm.sh --service my-service --duration 1h --limit 20

Returns:

  • Endpoints sorted by P95 latency
  • Request counts per endpoint
  • P50, P95, P99 latency
  • Problem endpoints (P95 > 500ms)

2. Query Security Signals

Find security threats and attack attempts:

bash scripts/query-security-signals.sh --service my-service --duration 24h

Returns:

  • Security signals by severity (critical, high, medium, low)
  • Attack types (SQL injection, XSS, auth failures)
  • Affected services and hosts
  • Recent security events with details

3. Query Watchdog Anomalies

Automated anomaly detection from Datadog Watchdog:

bash scripts/query-watchdog.sh --service my-service --type latency --duration 7d

Returns:

  • Anomalies by type (latency, error_rate, traffic)
  • Affected services and resources
  • Start timestamps and severity
  • Baseline vs observed values

4. Search Logs

Search logs for error patterns:

bash scripts/search-logs.sh --query "status:error service:my-service" --duration 1h

Returns:

  • Error messages grouped by frequency
  • Associated trace IDs for investigation
  • Service and host breakdowns
  • Common error patterns

5. Query Metrics

Fetch metric data with statistical analysis:

bash scripts/query-metrics.sh --metric "trace.express.request.duration" --service my-service --duration 24h

Returns:

  • Time series data
  • Statistics (min, max, avg, p50, p95, p99)
  • Trend analysis (increasing, decreasing, stable)
  • Anomaly detection (values > 2 std dev)

6. Analyze Usage and Costs

FinOps cost analysis and optimization:

bash scripts/analyze-usage-cost.sh --duration 30d --product all

Returns:

  • APM span ingestion (indexed vs ingested)
  • Log volume breakdown
  • Infrastructure hosts and container hours
  • Custom metrics count
  • Estimated monthly costs by product
  • Cost optimization recommendations

7. Analyze LLM Performance

For GenAI applications, analyze LLM observability data:

bash scripts/analyze-llm.sh --service my-llm-app --duration 24h

Returns:

  • Token usage statistics (prompt + completion)
  • Cost estimates based on model pricing
  • Model latency (P50, P95, P99)
  • Error rates by model
  • Most expensive operations
  • Token usage trends

8. Manage Monitors

Create, list, mute, and manage Datadog monitors:

# List all monitors
bash scripts/manage-monitors.sh list

# Create error rate monitor
bash scripts/manage-monitors.sh create \
  --name "High Error Rate" \
  --query "avg(last_5m):sum:trace.express.request.errors{service:my-service}.as_count() > 10" \
  --message "Error rate is high @slack-alerts"

# Mute monitor for 2 hours
bash scripts/manage-monitors.sh mute --id 12345 --duration 2

# Unmute monitor
bash scripts/manage-monitors.sh unmute --id 12345

Returns:

  • Monitor list with states (alert, warn, OK)
  • Created monitor ID and details
  • Mute/unmute confirmations

9. Create Dashboards

Generate dashboards from templates:

# Create APM performance dashboard
bash scripts/create-dashboard.sh --service payment-api --title "Payment API Performance" --type apm

# Create security monitoring dashboard
bash scripts/create-dashboard.sh --service payment-api --title "Security Dashboard" --type security

# Create cost analysis dashboard
bash scripts/create-dashboard.sh --title "Infrastructure Costs" --type cost

# Create LLM observability dashboard
bash scripts/create-dashboard.sh --service my-genai-app --title "LLM Performance" --type llm

Dashboard types:

  • apm: Latency, errors, throughput by endpoint
  • logs: Log volume and error analysis
  • security: Security threats and attack patterns
  • cost: APM, logs, infrastructure costs
  • llm: Token usage, costs, model performance

10. Query SLOs

Check Service Level Objectives and error budgets:

# List all SLOs
bash scripts/query-slos.sh

# List SLOs for service
bash scripts/query-slos.sh --service payment-api

# List SLOs with tag
bash scripts/query-slos.sh --tag team:backend

Returns:

  • SLO status (breaching, warning, OK)
  • Current value vs target threshold
  • Error budget remaining
  • Error budget status (exhausted, low, healthy)

11. Trigger Workflows

Execute Datadog workflow automation:

# List available workflows
bash scripts/trigger-workflow.sh list

# Trigger workflow
bash scripts/trigger-workflow.sh run --id abc123

# Trigger with input data
bash scripts/trigger-workflow.sh run --id abc123 --input '{"service": "payment-api", "severity": "high"}'

Returns:

  • Workflow list with IDs and descriptions
  • Workflow instance ID when triggered
  • Execution status

12. Manage Incidents

Create and manage incident response:

# List active incidents
bash scripts/manage-incidents.sh list --status active

# Create critical incident
bash scripts/manage-incidents.sh create \
  --title "Payment API Down" \
  --service payment-api \
  --severity SEV-1

# Update incident status
bash scripts/manage-incidents.sh update --id abc123 --status resolved

# Get incident details
bash scripts/manage-incidents.sh get --id abc123

Returns:

  • Incident list with status and severity
  • Created incident ID and details
  • Incident timeline and updates

13. Query Service Catalog

List services and ownership metadata:

# List all services
bash scripts/query-service-catalog.sh list

# List services for team
bash scripts/query-service-catalog.sh list --team backend

# Get service details
bash scripts/query-service-catalog.sh get --service payment-api

Returns:

  • Service metadata (kind, tier, lifecycle)
  • Team ownership and contacts
  • Repository links
  • Integration details

14. Manage Synthetic Tests

Create uptime checks and API tests:

# List all synthetic tests
bash scripts/manage-synthetics.sh list

# Create API uptime check
bash scripts/manage-synthetics.sh create-api \
  --name "Payment API Uptime" \
  --url "https://api.example.com/health" \
  --method GET

# Create browser test
bash scripts/manage-synthetics.sh create-browser \
  --name "Login Flow" \
  --url "https://app.example.com/login"

# Get test results
bash scripts/manage-synthetics.sh get --id abc-123-def

Returns:

  • Test list with status (active, paused)
  • Created test ID and configuration
  • Test results and uptime status

15. Query Database Performance

Analyze database queries and performance:

# Query database performance
bash scripts/query-database.sh --host postgres-prod --duration 1h

# Get slow queries
bash scripts/query-database.sh --host mysql-01 --duration 24h

Returns:

  • Slow query patterns
  • P95/avg query duration
  • Connection metrics
  • Top queries by latency

16. Query RUM (Real User Monitoring)

Analyze frontend performance and user experience:

# Query RUM data for application
bash scripts/query-rum.sh --application abc-123-def --duration 1h

# Get page load performance
bash scripts/query-rum.sh --application abc-123-def --duration 24h

Returns:

  • Page load times (avg, P95)
  • Frontend errors
  • Top pages by traffic
  • Error rate and types

17. Verify Setup

Validate Datadog configuration:

bash scripts/verify-setup.sh

Returns:

  • Environment variable validation
  • Agent connectivity check
  • Tracer installation detection

Incident Investigation Workflow

When investigating production issues:

1. Identify scope

# Check for security threats
bash scripts/query-security-signals.sh --severity critical --duration 1h

# Check for anomalies
bash scripts/query-watchdog.sh --service affected-service --duration 24h

2. Find performance issues

# Find slow endpoints
bash scripts/query-apm.sh --service affected-service --duration 1h

# Check error patterns
bash scripts/search-logs.sh --service affected-service --status error --duration 1h

3. Analyze metrics

# Check latency trends
bash scripts/query-metrics.sh --metric "trace.express.request.duration" --service affected-service --duration 24h

# Check error rate trends
bash scripts/query-metrics.sh --metric "trace.express.request.errors" --service affected-service --duration 24h

4. Get specific traces

# Get error traces
bash scripts/query-apm.sh --service affected-service --status error --limit 10

# Search logs for trace context
bash scripts/search-logs.sh --query "trace_id:abc123def456"

Security Analysis Workflow

Monitor and investigate security threats:

# Check critical security signals
bash scripts/query-security-signals.sh --severity critical --duration 7d

# Analyze specific service
bash scripts/query-security-signals.sh --service payment-api --duration 24h

# Search for attack patterns in logs
bash scripts/search-logs.sh --query "sql injection OR xss OR authentication failed" --duration 24h

Cost Optimization Workflow

Analyze and optimize Datadog costs:

# Get full cost breakdown
bash scripts/analyze-usage-cost.sh --duration 30d --product all

# Focus on APM costs
bash scripts/analyze-usage-cost.sh --duration 30d --product apm

# Extract high-priority recommendations
bash scripts/analyze-usage-cost.sh --duration 30d --product all | jq '.recommendations[] | select(.priority == "high")'

# Track weekly trends
bash scripts/analyze-usage-cost.sh --duration 7d --product all | jq '.cost_summary'

LLM Observability Workflow

For GenAI applications, monitor token usage and costs:

# Analyze LLM performance
bash scripts/analyze-llm.sh --service my-genai-app --duration 24h

# Filter by specific model
bash scripts/analyze-llm.sh --service my-genai-app --model gpt-4 --duration 7d

# Find most expensive operations
bash scripts/analyze-llm.sh --service my-genai-app --duration 30d | jq '.operations | sort_by(.total_cost_usd) | reverse | .[0:5]'

# Track token usage trends
bash scripts/analyze-llm.sh --service my-genai-app --duration 7d | jq '.summary.token_usage'

Deployment Impact Analysis

Compare metrics before/after deployment:

# Before deployment
bash scripts/query-apm.sh --service my-service --duration 1h > before.json
bash scripts/query-metrics.sh --metric "trace.express.request.duration" --service my-service --duration 1h >> before_metrics.json

# Deploy...

# After deployment
bash scripts/query-apm.sh --service my-service --duration 1h > after.json
bash scripts/query-metrics.sh --metric "trace.express.request.duration" --service my-service --duration 1h >> after_metrics.json

# Compare latency
jq -s '.[0].summary.avg_p95_ms - .[1].summary.avg_p95_ms' before.json after.json

# Check for new errors
bash scripts/search-logs.sh --service my-service --status error --duration 30m

Monitor Creation Workflow

Set up monitoring for new services:

# Create latency monitor
bash scripts/manage-monitors.sh create \
  --name "Payment API - High Latency" \
  --query "avg(last_5m):avg:trace.express.request.duration{service:payment-api} > 500" \
  --message "P95 latency above 500ms @slack-ops"

# Create error rate monitor
bash scripts/manage-monitors.sh create \
  --name "Payment API - Error Rate" \
  --query "avg(last_5m):sum:trace.express.request.errors{service:payment-api}.as_count() / sum:trace.express.request.hits{service:payment-api}.as_count() > 0.05" \
  --message "Error rate above 5% @pagerduty"

# Create APM dashboard
bash scripts/create-dashboard.sh --service payment-api --title "Payment API Performance" --type apm

# Create security dashboard
bash scripts/create-dashboard.sh --service payment-api --title "Payment API Security" --type security

Incident Response Workflow

Automated incident management:

# Check for SLO breaches
bash scripts/query-slos.sh --service payment-api | jq '.slos[] | select(.status == "breaching")'

# Create incident if SLO breached
bash scripts/manage-incidents.sh create \
  --title "Payment API SLO Breach" \
  --service payment-api \
  --severity SEV-2

# Trigger remediation workflow
bash scripts/trigger-workflow.sh run --id remediation-workflow-123 --input '{"service": "payment-api"}'

# Mute non-critical monitors during incident
bash scripts/manage-monitors.sh list --service payment-api | \
  jq '.monitors[] | select(.name | contains("non-critical")) | .id' | \
  xargs -I {} bash scripts/manage-monitors.sh mute --id {} --duration 2

# Update incident when resolved
bash scripts/manage-incidents.sh update --id abc123 --status resolved

SLO Monitoring Workflow

Track service level objectives:

# Check all SLOs
bash scripts/query-slos.sh

# Alert if error budget exhausted
EXHAUSTED=$(bash scripts/query-slos.sh | jq '.summary.budget_exhausted')
if [ "$EXHAUSTED" -gt 0 ]; then
  bash scripts/manage-incidents.sh create \
    --title "Error Budget Exhausted" \
    --service affected-service \
    --severity SEV-3
fi

# Weekly SLO report
bash scripts/query-slos.sh | jq '{
  total: .total_slos,
  breaching: .summary.breaching,
  low_budget: .summary.budget_low,
  at_risk: [.slos[] | select(.error_budget_remaining < 20) | {name, budget: .error_budget_remaining}]
}'

Output Format

All scripts return structured JSON for programmatic parsing:

{
  "status": "ok|warning|critical|error",
  "summary": {
    "...": "aggregated metrics"
  },
  "data": [...],
  "recommendations": [...]
}

Status messages go to stderr, JSON to stdout. This allows:

# Silent execution, capture JSON
bash scripts/query-apm.sh --service my-service --duration 1h 2>/dev/null | jq '.summary'

# Log messages only
bash scripts/query-apm.sh --service my-service --duration 1h >/dev/null

# Both
bash scripts/query-apm.sh --service my-service --duration 1h

Best Practices

Query Optimization

  • Use specific time ranges to reduce API calls
  • Filter by service/environment early
  • Paginate large result sets
  • Cache results when appropriate

Alert Investigation

  • Start with Watchdog anomalies (automated detection)
  • Correlate security signals with application errors
  • Check metrics for trend confirmation
  • Search logs for detailed context

Cost Control

  • Run analyze-usage-cost.sh monthly
  • Implement high-priority recommendations first
  • Monitor sampling rates for high-volume services
  • Track custom metric growth

Security Monitoring

  • Query security signals daily (automated check)
  • Filter by critical severity for alerting
  • Correlate with log patterns
  • Track attack trends over time

Limitations

  • API rate limits apply (varies by endpoint)
  • Historical data retention depends on Datadog plan
  • Real-time queries have eventual consistency
  • Requires live Datadog data (APM, logs, security monitoring)

Resources

Notes

This skill provides comprehensive Datadog automation: query live data to investigate issues AND create infrastructure (monitors, dashboards, incidents) for ongoing operations. It does not handle installation or initial setup - use Datadog documentation for that.

Investigation: Query APM, logs, metrics, security signals, SLOs, costs, and LLM usage to debug production issues.

Automation: Create monitors, generate dashboards, trigger workflows, manage incidents, and mute alerts during maintenance.

All scripts return structured JSON for integration with CI/CD pipelines, ChatOps workflows, and automation platforms.

Weekly Installs
7
GitHub Stars
2
First Seen
Feb 1, 2026
Installed on
codex7
opencode6
gemini-cli6
claude-code6
cursor6
antigravity5