Datadog Operations

Complete Datadog automation: query APIs, create infrastructure, manage incidents, and automate responses. 73% platform coverage with 17 working scripts.

What This Skill Does

Investigation & Analysis:

Query APM traces to identify performance bottlenecks
Search logs for error patterns and anomalies
Detect security threats and attack attempts
Analyze Watchdog anomaly detection alerts
Query metrics with statistical analysis
Analyze Datadog usage and costs (FinOps)
Monitor LLM observability for GenAI applications
Query SLO status and error budgets
List services from service catalog
Analyze database query performance
Track frontend performance with RUM

Automation & Creation:

Create and manage monitors with alert thresholds
Generate dashboards for APM, security, costs, and LLM observability
Trigger Datadog workflows for incident response
Create and update incidents
Mute/unmute monitors during maintenance
Create synthetic uptime checks and browser tests

Prerequisites

Set environment variables:

export DD_API_KEY=your_api_key
export DD_APP_KEY=your_application_key
export DD_SITE=datadoghq.com  # or datadoghq.eu, us3.datadoghq.com, etc.

Get keys from Datadog: Organization Settings > API Keys and Application Keys

Working Scripts

1. Query APM Performance

Find slow endpoints and performance issues:

bash scripts/query-apm.sh --service my-service --duration 1h --limit 20

Returns:

Endpoints sorted by P95 latency
Request counts per endpoint
P50, P95, P99 latency
Problem endpoints (P95 > 500ms)

2. Query Security Signals

Find security threats and attack attempts:

bash scripts/query-security-signals.sh --service my-service --duration 24h

Returns:

Security signals by severity (critical, high, medium, low)
Attack types (SQL injection, XSS, auth failures)
Affected services and hosts
Recent security events with details

3. Query Watchdog Anomalies

Automated anomaly detection from Datadog Watchdog:

bash scripts/query-watchdog.sh --service my-service --type latency --duration 7d

Returns:

Anomalies by type (latency, error_rate, traffic)
Affected services and resources
Start timestamps and severity
Baseline vs observed values

4. Search Logs

Search logs for error patterns:

bash scripts/search-logs.sh --query "status:error service:my-service" --duration 1h

Returns:

Error messages grouped by frequency
Associated trace IDs for investigation
Service and host breakdowns
Common error patterns

5. Query Metrics

Fetch metric data with statistical analysis:

bash scripts/query-metrics.sh --metric "trace.express.request.duration" --service my-service --duration 24h

Returns:

Time series data
Statistics (min, max, avg, p50, p95, p99)
Trend analysis (increasing, decreasing, stable)
Anomaly detection (values > 2 std dev)

6. Analyze Usage and Costs

FinOps cost analysis and optimization:

bash scripts/analyze-usage-cost.sh --duration 30d --product all

Returns:

APM span ingestion (indexed vs ingested)
Log volume breakdown
Infrastructure hosts and container hours
Custom metrics count
Estimated monthly costs by product
Cost optimization recommendations

7. Analyze LLM Performance

For GenAI applications, analyze LLM observability data:

bash scripts/analyze-llm.sh --service my-llm-app --duration 24h

Returns:

Token usage statistics (prompt + completion)
Cost estimates based on model pricing
Model latency (P50, P95, P99)
Error rates by model
Most expensive operations
Token usage trends

8. Manage Monitors

Create, list, mute, and manage Datadog monitors:

# List all monitors
bash scripts/manage-monitors.sh list

# Create error rate monitor
bash scripts/manage-monitors.sh create \
  --name "High Error Rate" \
  --query "avg(last_5m):sum:trace.express.request.errors{service:my-service}.as_count() > 10" \
  --message "Error rate is high @slack-alerts"

# Mute monitor for 2 hours
bash scripts/manage-monitors.sh mute --id 12345 --duration 2

# Unmute monitor
bash scripts/manage-monitors.sh unmute --id 12345

Returns:

Monitor list with states (alert, warn, OK)
Created monitor ID and details
Mute/unmute confirmations

9. Create Dashboards

Generate dashboards from templates:

# Create APM performance dashboard
bash scripts/create-dashboard.sh --service payment-api --title "Payment API Performance" --type apm

# Create security monitoring dashboard
bash scripts/create-dashboard.sh --service payment-api --title "Security Dashboard" --type security

# Create cost analysis dashboard
bash scripts/create-dashboard.sh --title "Infrastructure Costs" --type cost

# Create LLM observability dashboard
bash scripts/create-dashboard.sh --service my-genai-app --title "LLM Performance" --type llm

Dashboard types:

apm: Latency, errors, throughput by endpoint
logs: Log volume and error analysis
security: Security threats and attack patterns
cost: APM, logs, infrastructure costs
llm: Token usage, costs, model performance

10. Query SLOs

Check Service Level Objectives and error budgets:

# List all SLOs
bash scripts/query-slos.sh

# List SLOs for service
bash scripts/query-slos.sh --service payment-api

# List SLOs with tag
bash scripts/query-slos.sh --tag team:backend

Returns:

SLO status (breaching, warning, OK)
Current value vs target threshold
Error budget remaining
Error budget status (exhausted, low, healthy)

11. Trigger Workflows

Execute Datadog workflow automation:

# List available workflows
bash scripts/trigger-workflow.sh list

# Trigger workflow
bash scripts/trigger-workflow.sh run --id abc123

# Trigger with input data
bash scripts/trigger-workflow.sh run --id abc123 --input '{"service": "payment-api", "severity": "high"}'

Returns:

Workflow list with IDs and descriptions
Workflow instance ID when triggered
Execution status

12. Manage Incidents

Create and manage incident response:

# List active incidents
bash scripts/manage-incidents.sh list --status active

# Create critical incident
bash scripts/manage-incidents.sh create \
  --title "Payment API Down" \
  --service payment-api \
  --severity SEV-1

# Update incident status
bash scripts/manage-incidents.sh update --id abc123 --status resolved

# Get incident details
bash scripts/manage-incidents.sh get --id abc123

Returns:

Incident list with status and severity
Created incident ID and details
Incident timeline and updates

13. Query Service Catalog

List services and ownership metadata:

# List all services
bash scripts/query-service-catalog.sh list

# List services for team
bash scripts/query-service-catalog.sh list --team backend

# Get service details
bash scripts/query-service-catalog.sh get --service payment-api

Returns:

Service metadata (kind, tier, lifecycle)
Team ownership and contacts
Repository links
Integration details

14. Manage Synthetic Tests

Create uptime checks and API tests:

# List all synthetic tests
bash scripts/manage-synthetics.sh list

# Create API uptime check
bash scripts/manage-synthetics.sh create-api \
  --name "Payment API Uptime" \
  --url "https://api.example.com/health" \
  --method GET

# Create browser test
bash scripts/manage-synthetics.sh create-browser \
  --name "Login Flow" \
  --url "https://app.example.com/login"

# Get test results
bash scripts/manage-synthetics.sh get --id abc-123-def

Returns:

Test list with status (active, paused)
Created test ID and configuration
Test results and uptime status

15. Query Database Performance

Analyze database queries and performance:

# Query database performance
bash scripts/query-database.sh --host postgres-prod --duration 1h

# Get slow queries
bash scripts/query-database.sh --host mysql-01 --duration 24h

Returns:

Slow query patterns
P95/avg query duration
Connection metrics
Top queries by latency

16. Query RUM (Real User Monitoring)

Analyze frontend performance and user experience:

# Query RUM data for application
bash scripts/query-rum.sh --application abc-123-def --duration 1h

# Get page load performance
bash scripts/query-rum.sh --application abc-123-def --duration 24h

Returns:

Page load times (avg, P95)
Frontend errors
Top pages by traffic
Error rate and types

17. Verify Setup

Validate Datadog configuration:

bash scripts/verify-setup.sh

Returns:

Environment variable validation
Agent connectivity check
Tracer installation detection

Incident Investigation Workflow

When investigating production issues:

1. Identify scope

# Check for security threats
bash scripts/query-security-signals.sh --severity critical --duration 1h

# Check for anomalies
bash scripts/query-watchdog.sh --service affected-service --duration 24h

2. Find performance issues

# Find slow endpoints
bash scripts/query-apm.sh --service affected-service --duration 1h

# Check error patterns
bash scripts/search-logs.sh --service affected-service --status error --duration 1h

3. Analyze metrics

# Check latency trends
bash scripts/query-metrics.sh --metric "trace.express.request.duration" --service affected-service --duration 24h

# Check error rate trends
bash scripts/query-metrics.sh --metric "trace.express.request.errors" --service affected-service --duration 24h

4. Get specific traces

# Get error traces
bash scripts/query-apm.sh --service affected-service --status error --limit 10

# Search logs for trace context
bash scripts/search-logs.sh --query "trace_id:abc123def456"

Security Analysis Workflow

Monitor and investigate security threats:

# Check critical security signals
bash scripts/query-security-signals.sh --severity critical --duration 7d

# Analyze specific service
bash scripts/query-security-signals.sh --service payment-api --duration 24h

# Search for attack patterns in logs
bash scripts/search-logs.sh --query "sql injection OR xss OR authentication failed" --duration 24h

Cost Optimization Workflow

Analyze and optimize Datadog costs:

# Get full cost breakdown
bash scripts/analyze-usage-cost.sh --duration 30d --product all

# Focus on APM costs
bash scripts/analyze-usage-cost.sh --duration 30d --product apm

# Extract high-priority recommendations
bash scripts/analyze-usage-cost.sh --duration 30d --product all | jq '.recommendations[] | select(.priority == "high")'

# Track weekly trends
bash scripts/analyze-usage-cost.sh --duration 7d --product all | jq '.cost_summary'

LLM Observability Workflow

For GenAI applications, monitor token usage and costs:

# Analyze LLM performance
bash scripts/analyze-llm.sh --service my-genai-app --duration 24h

# Filter by specific model
bash scripts/analyze-llm.sh --service my-genai-app --model gpt-4 --duration 7d

# Find most expensive operations
bash scripts/analyze-llm.sh --service my-genai-app --duration 30d | jq '.operations | sort_by(.total_cost_usd) | reverse | .[0:5]'

# Track token usage trends
bash scripts/analyze-llm.sh --service my-genai-app --duration 7d | jq '.summary.token_usage'

Deployment Impact Analysis

Compare metrics before/after deployment:

# Before deployment
bash scripts/query-apm.sh --service my-service --duration 1h > before.json
bash scripts/query-metrics.sh --metric "trace.express.request.duration" --service my-service --duration 1h >> before_metrics.json

# Deploy...

# After deployment
bash scripts/query-apm.sh --service my-service --duration 1h > after.json
bash scripts/query-metrics.sh --metric "trace.express.request.duration" --service my-service --duration 1h >> after_metrics.json

# Compare latency
jq -s '.[0].summary.avg_p95_ms - .[1].summary.avg_p95_ms' before.json after.json

# Check for new errors
bash scripts/search-logs.sh --service my-service --status error --duration 30m

Monitor Creation Workflow

Set up monitoring for new services:

# Create latency monitor
bash scripts/manage-monitors.sh create \
  --name "Payment API - High Latency" \
  --query "avg(last_5m):avg:trace.express.request.duration{service:payment-api} > 500" \
  --message "P95 latency above 500ms @slack-ops"

# Create error rate monitor
bash scripts/manage-monitors.sh create \
  --name "Payment API - Error Rate" \
  --query "avg(last_5m):sum:trace.express.request.errors{service:payment-api}.as_count() / sum:trace.express.request.hits{service:payment-api}.as_count() > 0.05" \
  --message "Error rate above 5% @pagerduty"

# Create APM dashboard
bash scripts/create-dashboard.sh --service payment-api --title "Payment API Performance" --type apm

# Create security dashboard
bash scripts/create-dashboard.sh --service payment-api --title "Payment API Security" --type security

Incident Response Workflow

Automated incident management:

# Check for SLO breaches
bash scripts/query-slos.sh --service payment-api | jq '.slos[] | select(.status == "breaching")'

# Create incident if SLO breached
bash scripts/manage-incidents.sh create \
  --title "Payment API SLO Breach" \
  --service payment-api \
  --severity SEV-2

# Trigger remediation workflow
bash scripts/trigger-workflow.sh run --id remediation-workflow-123 --input '{"service": "payment-api"}'

# Mute non-critical monitors during incident
bash scripts/manage-monitors.sh list --service payment-api | \
  jq '.monitors[] | select(.name | contains("non-critical")) | .id' | \
  xargs -I {} bash scripts/manage-monitors.sh mute --id {} --duration 2

# Update incident when resolved
bash scripts/manage-incidents.sh update --id abc123 --status resolved

SLO Monitoring Workflow

Track service level objectives:

# Check all SLOs
bash scripts/query-slos.sh

# Alert if error budget exhausted
EXHAUSTED=$(bash scripts/query-slos.sh | jq '.summary.budget_exhausted')
if [ "$EXHAUSTED" -gt 0 ]; then
  bash scripts/manage-incidents.sh create \
    --title "Error Budget Exhausted" \
    --service affected-service \
    --severity SEV-3
fi

# Weekly SLO report
bash scripts/query-slos.sh | jq '{
  total: .total_slos,
  breaching: .summary.breaching,
  low_budget: .summary.budget_low,
  at_risk: [.slos[] | select(.error_budget_remaining < 20) | {name, budget: .error_budget_remaining}]
}'

Output Format

All scripts return structured JSON for programmatic parsing:

{
  "status": "ok|warning|critical|error",
  "summary": {
    "...": "aggregated metrics"
  },
  "data": [...],
  "recommendations": [...]
}

Status messages go to stderr, JSON to stdout. This allows:

# Silent execution, capture JSON
bash scripts/query-apm.sh --service my-service --duration 1h 2>/dev/null | jq '.summary'

# Log messages only
bash scripts/query-apm.sh --service my-service --duration 1h >/dev/null

# Both
bash scripts/query-apm.sh --service my-service --duration 1h

Best Practices

Query Optimization

Use specific time ranges to reduce API calls
Filter by service/environment early
Paginate large result sets
Cache results when appropriate

Alert Investigation

Start with Watchdog anomalies (automated detection)
Correlate security signals with application errors
Check metrics for trend confirmation
Search logs for detailed context

Cost Control

Run analyze-usage-cost.sh monthly
Implement high-priority recommendations first
Monitor sampling rates for high-volume services
Track custom metric growth

Security Monitoring

Query security signals daily (automated check)
Filter by critical severity for alerting
Correlate with log patterns
Track attack trends over time

Limitations

API rate limits apply (varies by endpoint)
Historical data retention depends on Datadog plan
Real-time queries have eventual consistency
Requires live Datadog data (APM, logs, security monitoring)

Resources

Notes

This skill provides comprehensive Datadog automation: query live data to investigate issues AND create infrastructure (monitors, dashboards, incidents) for ongoing operations. It does not handle installation or initial setup - use Datadog documentation for that.

Investigation: Query APM, logs, metrics, security signals, SLOs, costs, and LLM usage to debug production issues.

Automation: Create monitors, generate dashboards, trigger workflows, manage incidents, and mute alerts during maintenance.

All scripts return structured JSON for integration with CI/CD pipelines, ChatOps workflows, and automation platforms.

datadog-operations

Datadog Operations

What This Skill Does

Prerequisites

Working Scripts

1. Query APM Performance

2. Query Security Signals

3. Query Watchdog Anomalies

4. Search Logs

5. Query Metrics

6. Analyze Usage and Costs

7. Analyze LLM Performance

8. Manage Monitors

9. Create Dashboards

10. Query SLOs

11. Trigger Workflows

12. Manage Incidents

13. Query Service Catalog

14. Manage Synthetic Tests

15. Query Database Performance

16. Query RUM (Real User Monitoring)

17. Verify Setup

Incident Investigation Workflow

Security Analysis Workflow

Cost Optimization Workflow

LLM Observability Workflow

Deployment Impact Analysis

Monitor Creation Workflow

Incident Response Workflow

SLO Monitoring Workflow

Output Format

Best Practices

Limitations

Resources

Notes